Vol. 4, №5, 2019
Murynin A. B., Trekin A. N., Ignatiev V. Yu., Kulchenkova V. G., Rakova K. O. Approach to enhancement of spatial resolution of the rigid objects satellite imagery // Machine Learning and Data Analysis, 2019, 4(5):296-308. doi:10.21469/22233792.4.5.01 The approach is proposed allows to increase the spatial resolution of space images of rigid objects using vector information about the geometric properties of these objects. Enhancement of the resolution of a low-quality image is carried out using a probabilistic approach using the most optimal parameters obtained by minimizing the difference between the reference image and the result of the method's work on the test data set. The results of the study of the spatial resolution enhancement on different types of the underlying surface at different scales are presented.
Rogozin A. B. Accelerated Nesterov Method for Decentralized Distributed Optimization on Time-Varying graphs // Machine Learning and Data Analysis, 2019, 4(5):309-315. doi:10.21469/22233792.4.5.02 The paper is focused on first-order methods in case when the aim function changes from one iteration to another. This problem is motivated by distributed optimization on networks which can periodically change because of technical malfunctions such as a loss of connection between two nodes. The main results of the paper include theoretical guarantees for linear convergence of distributed gradient descent and distributed Nesterov accelerated method on strongly convex smooth objective functions under the assumption that the network has a finite number of changes.
Dulin S. K., Yakushev D. A. Development of methods for filtering laser reflection points (mathematical morphological filtering, progressive filtering, segmentation) for the identification of technogenic objects // Machine Learning and Data Analysis, 2019, 4(5):316-323. doi:10.21469/22233792.4.5.03 A large amount of data obtained by laser scanning, imply the presence of effective data processing technology. The authors have developed a two-stage technology of spatial data processing (point cloud), obtained as a result of mobile laser scanning and tied to a high-precision coordinate network, which ensures the construction of the 3D model in accordance with the established norms. It can also be applied to the results of high-resolution photogrammetric photography. The technology includes methods for rapid determination of the railway track, algorithms for the construction of individual elements of the infrastructure (support of a contact network), algorithms to determine the profile of the railway, as well as methods of processing another elements of the infrastructure (facilities and devices of the railway, automatics and telemechanic and telecommunications station facilities).
Dulin S. K., Yakushev D. A. Implementation of methods of geoinformation description of technogenic objects of railway transport in the experimental software and hardware complex, providing geoinformation support for the management of the transportation process // Machine Learning and Data Analysis, 2019, 4(5):324-329. doi:10.21469/22233792.4.5.04 One of the most important results of the project was the creation of a digital railway model, which is a formalized mathematical and semantic description of the geometric characteristics and spatial position of the railway track and other object of infrastructure, obtained as result the processing of geodetic measurements in high-precision coordinate space. A digital railway model is actively used to form the optimal design position of the railway track in a single coordinate space and, accordingly, it is possible to accurately compare the design position of the track with the actual data before and after repair. The fundamental point here is the fact of the exact coordinate reference of all measurements with each other, but there are nuances. A central repository of digital railway models is the integrated railway infrastructure spatial data system, which also has the tools to compare design and actual data loaded into the system or converted by the system in the form of a digital railway model.
Vol. 4, №4, 2018
Beklaryan L. A., Beklaryan A. L. On the existence of soliton solutions for systems with a polynomial potential and their numerical realization // Machine Learning and Data Analysis, 2018, 4(4):220-234. doi:10.21469/22233792.4.4.01 The problem of existence of soliton solutions (solutions of the traveling wave type) for the Korteweg-de Vries equation with a polynomial potential is considered on the basis of the approach within which the presence of a one-to-one correspondence of such solutions with solutions of the induced functional differential equation of pointwise type is demonstrated. On this path, conditions for the existence and uniqueness of solutions of the traveling wave type, with the growth restrictions both in time and in space, arise. It is very important that the conditions for the existence of a traveling wave solution are formed in terms of the right-hand side of the equation and the characteristics of the traveling wave, without using either the linearization and spectral properties of the corresponding equation in variations. Conditions for the existence of periodic soliton solutions are considered separately, and the possibility of transition from systems with a quasilinear potential to systems with a polynomial potential with conservation of corresponding existence theorems is demonstrated. Numerical implementation of such solutions is given.
Murynin A. B., Richter A. A. Features of the application of methods and algorithms for the reconstruction of the three-dimensional shape of rigid objects according to the panoramic survey // Machine Learning and Data Analysis, 2018, 4(4):235-247. doi:10.21469/22233792.4.4.02 The paper discusses the methods for restoring the shape of three-dimensional objects of the earth's surface using periodic features of the structure of surfaces of rigid objects, applicable to both space and panoramic images of these objects. A brief review of previously developed methods for the restoration of three-dimensional objects in one image (based on metadata, based on standards, based on coordinate grids) is given. The main features of the panoramic shooting, the region and the limits of its applicability are given. Considered the possibility of joint use of space and panoramic shooting. A technique based on the selection of geometric periods on the surface of a rigid object and the evaluation of their geometric parameters is described. The example of the building shows the main structural elements, the geometrical parameters of the object, assessed during a panoramic survey. An example of the restoration of the three-dimensional model of the object.
Mandrikova O. V., Zalyaev T. L., Geppener V. V., Mandrikova B. S. Analysis of the neutron monitor data and allocation of the sporadic features on the basis of neural networks and wavelet transform // Machine Learning and Data Analysis, 2018, 4(4):248-265. doi:10.21469/22233792.4.4.03 Galactic cosmic rays observations are used in a number of fundamental and applied studies related to monitoring and forecasting space weather. The complex structure of cosmic rays data and incomplete a priori knowledge of processes in near-earth space make it difficult to construct effective methods for their analysis. The traditional spectral and averaging methods currently used allow to distinguish stable characteristics of cosmic rays dynamics, but are ineffective for studying thin sporadic changes. Modern global methods such as global survey method make it possible to identify dynamic features in CR with more accuracy, but they require laborious calculations and their automation is very difficult.
The present paper proposes a method and computational algorithms for analysis of cosmic ray data and detection of sporadic effects. The method is based on the use of the neural networks and the wavelet transform. The neural networks of vector quantization and multilayer perceptron are used. The efficiency of the application of the neural networks of vector quantization for the problem of classification of neutron monitor data in automatic mode is shown. A method for approximating of the cosmic ray time course is presented. The method is based on the neural network of a multilayer perceptron and the fast wavelet transform. A computational algorithm for the detailed analysis of neutron monitor data and detection of multiscale sporadic effects is described.
The results of the experiments showed the effectiveness of the application of the proposed methods for the analysis of GCR data and the allocation of sporadic effects. The proposed method can be implemented in an automatic mode for processing of the registered neutron monitor data and an operational assesment of the GCR level, which determines its applied significance.
The results of application of various NN architectures have shown the promise of using both feedforward multilayer NN and NN of vector quantization. In the future, the authors plan to carry out the approbation of constructed NN architectures on more representative statistics with the expansion of the number of analyzed data recording stations.
Voznesenskaya T. V., Lednov D. A. Automatic text summarization system using a stochastic model // Machine Learning and Data Analysis, 2018, 4(4):266-279. doi:10.21469/22233792.4.4.04 This paper is toward the system of automatic text summarization developed by «DC – Systems» company in cooperation with the faculty of computer science at HSE. The summary is a concise description of the text in terms of its content and meaning, i.e. from the point of view of its semantics. The purpose of the summarization is to reduce the text as much as possible while maintaining the main content. A summary in this article is built using syntactically correlated word combinations. In this case, the possible additional meanings of separate fragments of the text are neglected. The quality of the summary is evaluated by a matching to the source text in terms of semantics.
The main problem is split into two parts: an evaluation of the whole text semantics, without subdivision into parts, and the text transformation to derive an annotation. The architecture of the developed system and the main algorithm are described. An example of summary derived by the system and its quality evaluation has been provided. The current version of the system has following restrictions: it does not permit any formulas and special signs.
Ivanova A. S., Dvurechensky P. E., Gasnikov A. V. Composite optimization for the resource allocation problem // Machine Learning and Data Analysis, 2018, 4(4):280-290. doi:10.21469/22233792.4.4.05 In this paper we consider resource allocation problem stated as a convex minimization problem with linear constraints. To solve this problem, we use subgradient method and gradient descent applied to the dual problem and prove the convergence rate both for the primal iterates and the dual iterates. We also provide economic interpretation for these two methods. This means that iterations of the algorithms naturally correspond to the process of price and production adjustment in order to obtain the desired production volume in the economy. Overall, we show how these actions of the economic agents lead the whole system to the equilibrium.
Vol. 4, №3, 2018
Kirilyuk I. L., Senko O. V. Studies of the relationship between non-stationary time series using the production functions // Machine Learning and Data Analysis, 2018, 4(3):142-151. doi:10.21469/22233792.4.3.01 False regression occurs if the standard means of detecting patterns in models, such as the magnitude of the coefficient of determination, indicate the existence of a relationship between variables, while in reality it is absent.
The phenomenon of false regression was studied using the Cobb-Douglas model of production functions for time series data for regions of Russian Federation for the period of time 1996-2014.
Classical methods of linear regression analysis, such as the F-test or Student's test, were used together with methods of estimating stationarity and cointegration of variables belonging to multidimensional time series. Additional analysis was implemented by using Monte-Carlo techniques to simulate stationary time series as sequences of independent normally distributed independent random variables. Non-stationary time series, corresponding to unit root autoregretgression process, were generated, using independent equally distributed increments between previous and subsequent elements of time series. Statistical validation of regression models received at true data sets is based on comparing their quality with quality of models at stationary or non-stationary multivariate time series, generated under mutual independence of variables.
It is shown that the reliability of the dependence expressed by the Cobb-Douglas function is verified not only when using imitation processes of Gaussian white noise, but also when imitating non-stationary autoregression processes with a single root.
However using the described method not only to the initial data, but also to data with subtracted time dependence trends showed that effect has no other significant cause than presence of similar linear trends in time series associated with each factor.
Naumov V. A., Nelyubina E. A., Ryazanov V. V., Vinogradov A. P. Analysis and prediction of hydrological series based on generalized precedents // Machine Learning and Data Analysis, 2018, 4(3):152-164. doi:10.21469/22233792.4.3.02 The paper presents a new approach to the use of the apparatus of generalized precedents in the problems of analysis and prediction of hydrological series. Generalized precedents are computational tools that yield to use various a priori, directly observed or preferred for some or another reason local regularities in the data on a unified basis. The main stages of the scheme for applying generalized precedents are presented, and a close relationship is shown with the Hough transform scheme. The possibilities of comparison and joint analysis of meteorological data and actual data on the volume of river flow are investigated. In this case, the generalized precedents are typical nonlinear relationships between certain hydrological parameters. The goal is to identify the differentiation of the regions of the river basin by their accumulating capabilities. We show how this can be done on the basis of an analysis of time-limited contemporary statistics. Obtained flow characteristics in the regions can be further used for short-term forecasting of river level variations and other hydrological processes and phenomena, including flood and drought situations. These characteristics can also serve as an important factor in the study of ecosystems, geology of the region and other similar purposes.
Lange M. M., Lange A. M. On Information Theoretical Model for Data Classification // Machine Learning and Data Analysis, 2018, 4(3):165-179. doi:10.21469/22233792.4.3.03 A data classification model based on the average mutual information between a set of objects under classification and a set of decisions about classes of the objects is developed. An optimization of the model consists in minimization of the average mutual information over conditional distributions for the decisions subject to a given error rate. Finding this minimum is equivalent to calculation of the rate-distortion function in a scheme of coding the random discrete class labels that are transformed into the appropriate objects by a continuous observation channel with the known class-conditional probability densities. For the classification schemes by the decision rules without and with a reject option, the lower bounds to the rate-distortion functions are calculated. These bounds allow us to compare the potential attained error rates using the different sets of submitted objects and the different observation channels. The theoretical results are supported by experimental error rates for face recognition within the decorrelated components of RGB images.
Nosova S. A., Turlapov V. E. GLCM, kNN and Meanshift for neuron detection on Nissl-stained brain slice images // Machine Learning and Data Analysis, 2018, 4(3):180-191. doi:10.21469/22233792.4.3.04 The method for neuron detection on Nissl-stained brain slice images is proposed. The method uses textural features of neurons extracted from 4 GLC-matrices. The method includes the following steps: image preprocessing, kNN classification by the textural features and Meanshift clustering of neuron pixels.
Preprocessing includes the following steps: grayscale conversion, histogram equalization, histogram quantization. Gray conversion by blue component gives the best result. It is shown that using 2-,4- bin histogram gives close detection quality with 8-bin histogram (F1 =0,83..0,85).
For pixel classification kNN algorithm was used. The results demonstrate that kNN is better choice for current task in comparing with NBC.
The reached detection quality for given approach is precision=0,82, recall=0,92, F1=0,86. Is is shown that our results are near the same or some better in recall characteristic in comparing with other neuron detection method.
In our future work we'll prolong this investigation for great volume of dataset and special dataset for important diseases.
Tleubaev A. T., Stupnikov S. A. Application of machine learning methods for subject classification of the internet domains // Machine Learning and Data Analysis, 2018, 4(3):192-214. doi:10.21469/22233792.4.3.05 The paper is devoted to the application of machine learning methods for the automation Subject classification of the Internet"= domains. The specific task is to automatically assign the Internet"= domain to a category from a predefined hierarchical category tree. Various classifiers were used in the work, they proved themselves well in the work with strongly discharged feature spaces of large dimension. The characteristic spaces were formed on the basis of texts from the main pages of domains using the TF-IDF and N-gram approaches. Two approaches to the application of classification methods for solving the problem are developed: direct and multilevel. With a direct approach, a single classifier is used, for each domain its category is predicted, the category can be of any level in the category tree. At the multilevel approach the set of classifiers is applied: to each set of categories with one parent there corresponds the separate classifier. Classifiers are applied hierarchically~--- from root to leaf categories. A combination of the proposed approaches is also used. One of the practical applications of the work is user profiling based on the sites visited by them and further offering personalized advertising.
Vol. 4, №2, 2018
Mandrikova O. V., Fetisova N. V., Polozov Yu. A. Modeling and analysis of natural time series on the basis of general multicomponent model // Machine Learning and Data Analysis, 2018, 4(2):74-88. doi:10.21469/22233792.4.2.01 The work is focused on the development of methods for modelling and analysis of natural time series and the construction of automated systems on their basis. The present paper proposes a general multicomponent model (GMCM) of complex time series that allows describing irregular variations in the data. The GMCM recurrent component is represented by a parametric form and describes the regular time course of the data. The GMCM anomalous components are represented by nonlinear approximating schemes and describe irregular variations. On the example of the ionospheric critical frequency time series of F2-layer (data from the world network of ionospheric stations were used), the implementation of the model is described, and the results of its application are presented. A comparison with the IRI international empirical model and the median method confirmed the efficiency of the GMCM. The proposed GMCM, in contrast to analogs, allows us to detect anomalous changes in the data and to estimate their characteristics in automatic mode. The model is implemented numerically and is available from the Internet (http://aurorasa.ikir.ru:8580). The results of the research are important in the tasks of geophysical monitoring and operational forecast of space weather.
Vasiliev E., KomuroT., Turlapov V., Nikolsky A. One hand aerial gesture control for AR-based cardiac interventions // Machine Learning and Data Analysis, 2018, 4(2):89-96. doi:10.21469/22233792.4.2.02 We consider the problem of interaction of the operating surgeon with medical software during the operations.
We propose an interface that allows the user to interact with a cg model using one hand gestures, executing operations that can be performed with one hand are moving, zooming and rotating.
Using only one hand to interact with the model is more convenient for a person than using two hands.
We implemented three type of cursor: Without pointer (This option is suitable for touch panels, because in this case the user's finger acts as pointer), simple dot (Easy to create, but not very descriptive, especially when moving in three dimensions), virtual hand (Very descriptive in 3D space, but very difficult to create, off the shelf implementation is not available)
We created a demo application to show the advantages of this approach.
Chukanov S. N., Leykhter S. V. The matching of diffeomorphic images based on topological data analysis // Machine Learning and Data Analysis, 2018, 4(2):97-107. doi:10.21469/22233792.4.2.03 The problem of matching of a initial and terminal images, which is solved on the basis of the construction of a minimized functional characterizing the evolution of the diffeomorphic transformation of the image from a initial to terminal image, and the penalty for deviating the image path from the required trajectory, is considered in this paper. The form of the object is analyzed when recognizing object images using persistent homology methods. The shape characteristics determined by topological methods do not depend on the coordinate representation of the form under consideration and are invariant under diffeomorphic transformations. A distinctive feature of using persistent homologies with respect to methods of algebraic topology is obtaining more information about the form of the object.
Pyt'ev Yu. P., Falomkina O. V., Shishkin S. A., Chulichkov A. I. Mathematical formalism for subjective modeling // Machine Learning and Data Analysis, 2018, 4(2):108-121. doi:10.21469/22233792.4.2.04 The mathematical formalism for subjective modeling (MFSM) of uncertainty, which reflects the unreliability of the subjective information and its matter fuzziness, is created. The MFSM allows the researcher-modeler (r-m) to construct models using the unformalized, incomplete and inconsistent data ranging from the ``absolute ignorance'' up to the ``complete knowledge'' of the model of the research object (RO). Since the ``complete knowledge'' of the model is equivalent to the condition of the applicability of the ``standart'' modeling, the proposed MFSM significantly generalizes the ``standart'' mathematical modeling. If data related to the RO is available, the MFSM allows the r-m to use them to test the adequacy of the subjective model to the research objective, to correct the subjective model, and under certain conditions -- to empirically reconstruct the RO model.
Murashov D. M., Berezin A. V., Ivanova E. Yu. Painting canvas thread counting from images obtained in raking light // Machine Learning and Data Analysis, 2018, 4(2):122-135. doi:10.21469/22233792.4.2.05 This paper deals with the problem of painting thread counting from images. It is necessary to determine the characteristics used by art historians for dating works of art. In the last few years, automated algorithms for calculating canvas characteristics from x-ray and high-quality terahertz images have been developed. To control the fabric density in textile industry, microscopic photographs, obtained when the fabric sample is illuminated by a light transmitted source, are used. The peculiarity of our research is acquiring canvas images in raking light. This way of acquiring images allowed to emphasize the texture of the canvas in the specified direction. For the analysis of canvas sample images we propose modifications of known algorithm based on a filtering in the Fourier domain and thresholding, and the new algorithm based on localizing grayscale image ridges.
In known works, the number of threads is determined by the Fourier spectrum peaks or by the baselines in the canvas image. In this paper, the counting of threads is performed over all rows / columns of the image matrix, and a histogram is constructed based on the results. The desired number of threads is determined by the maximum of the histogram obtained. The use of histograms allows to reduce inaccuracy produced by artifacts obtained during image processing. For thresholding, Otsu and Niblack methods are applied. A computing experiment on the study of canvases of five portraits by Russian artists of the 18th century was carried out. The results of the experiment show the following.
The algorithm based on the Otsu method does not require parameters and has acceptable accuracy and high speed. However, on several images this algorithm gave an unacceptable result. The algorithm based on the Niblack method requires setting up two parameters and computationally expensive compared to the algorithm with the global threshold method, but showed on average a higher density measurement accuracy. The measurement algorithm based on localizing grayscale image ridges requires setting more parameters and significantly higher computational costs than other algorithms, but has shown the best result in measuring accuracy within the error limit acceptable for expertise and attribution of paintings.The researched algorithms provide the accuracy of measuring the canvas density from within one thread per centimeter for $70\--97$ percents of the sample images.
The results of the computing experiment correspond to the results of known algorithms for measuring canvas density from x-ray images of paintings.To improve the reliability the canvas density measurements in painting analysis is preferable to use several algorithms. Further research will be aimed at improving the accuracy and speed of the algorithms.
Vol. 4, №1, 2018
Voronov, A. D., Gromov, A. N., Inyakin, A. S., Zamkovoy, A. A. Verification of the expert assessments in revealing of the relevant exogenous factors affecting cargo transportation demand // Machine Learning and Data Analysis, 2018, 4(1):6-15. doi:10.21469/22233792.4.1.04 This research reveals exogenous factors that affect forecast amounts of railroad transportation to improve their relevance. The authors propose to include exogenous factors’ influence on making a forecast model is proposed. The expert assessments determine the relevance of the factors. This research proposes reliability assessment methods and methods of revealing structure and type of influence of exogenous factors on the amount of cargo transportation. Results systematize the expert review on the influence of exogenous factors on forecast amounts of cargo transportation. The technique of conducting an expert analysis of the importance and the type of influence on the cargo transportation demand was described.
Voronov, A. D., Gromov, A. N., Inyakin, A. S., Zamkovoy, A. A. Forecasting amount of demand for cargo transportation for stationar time series // Machine Learning and Data Analysis, 2018, 4(1):16-35. doi:10.21469/22233792.4.1.05 The properties of prognostic models of volumes of demand for freight rail transportation with the purpose of structuring processes in the field of management and planning of freight rail transportation are investigated. The paper proposes four models for forecasting the volumes of demand for freight rail transportation, taking into account the specificity of the measured data, business processes and standards of the industrial partner. When constructing models, multivariate statistical analysis and forecasting of interdependent time series are used. The properties of the constructed models are analyzed. Forecasts are made in the sections of day, week, month for stations and regions. The proposed prognostic models are compared by the criteria of the average absolute and average percentage error.
Dulin, S. K., Yakushev, D. A. Developing of maps, based on mobile laser scanning data, for security locomotive devices and control systems to manage electric trains traffic // Machine Learning and Data Analysis, 2018, 4(1):36-43. doi:10.21469/22233792.4.1.01 The task of forming common electronic maps for various locomotive safety devices and con- trolling the movement of electric trains is extremely urgent, its implementation is designed to improve traffic safety. New possibilities of map formation are provided by the complex system of spatial data of the railway transport infrastructure (CSSD RTI), into which the coordinates of all technogenic objects are entered based on the results of processing mobile laser scanning data obtained in a high-precision coordinate system. As information, the points of reflection from all objects, measured with sub-centimeter accuracy and annotated with photographs, coordinated in three-dimensional space, make it possible to identify all the technogenic objects that are significant from the point of view of traffic safety in the railway transport.
Yakushev, D. A. 3D-modeling of technical condition of railway engineering objects with Bentley Systems software // Machine Learning and Data Analysis, 2018, 4(1):44-51. doi:10.21469/22233792.4.1.02 The lack of a unified measurement system and the low accuracy of design documentation, which establishes requirements only for the minimum size and inter-positions, as well as the current system of assessing the state of technogenic infrastructure facilities, which determines only the indicator of grossness, not related to the spatial position, does not leave even the theoretical possibility to implement design solutions during construction and to maintain infrastructure in the design position during operation. The technology of informational modeling of technogenic objects of the railway transport infrastructure in the three-dimensional coordinate space is intended to change the situation. For example, created in 2016, three-dimensional models in the MCC sites allowed to identify serious differences of the constructed object with project documentation.
Koltsov, P. P., Osipov, A. S., Sotnezov, R. M., Chekhovich, Yu. V., Yakushev, D. A. П. П. Кольцов, А. С. Осипов, Р. М. Сотнезов, Ю. В. Чехович, Д. А. Якушев Fundamental problems of empirical estimations for computer vision // Machine Learning and Data Analysis, 2018, 4(1):52-68. doi:10.21469/22233792.4.1.03 The paper deals with the comparative study of image processing analysis algorithms implemented in the software and hardware based security systems. The main principles of EDEM methodology, implemented for this purpose, are considered with the focus on elements of the fuzzy set theory used for the comparative evaluation. In particular, the concepts of fuzzy ground truth images and fuzzy similarity measures are considered. Some examples of application of EDEM methodology, including the evaluation of algorithms used for solving some rail security tasks are given.
Vol. 3, №1, 2017
Kulunchakov, A. S. Creation of parametric rules to rewrite algebraic expressions in Symbolic Regression // Machine Learning and Data Analysis, 2017, 3(1):6-19. doi:10.21469/22233792.3.1.01 This paper investigates the problem of bloat in Symbolic Regression. It develops a procedure to simplify superpositions generated by SR. Our approach borrows ideas of equivalent decision simplification and apply them to create parametric rules of rewriting. Except eliminating excessive parts of superpositions, these rules reduce the dimensionality of parameter space of generated superpositions. The computational experiment is conducted on a dataset related to Brent Crude Oil options. There we approximate volatility of options prices by its strike prices and expiration date.
Izmailov P.A., Kropotov D.P. Faster variational inducing input Gaussian process classification // Machine Learning and Data Analysis, 2017, 3(1):20-35. doi:10.21469/22233792.3.1.02 Background: Gaussian processes (GP) provide an elegant and effective approach to learning in kernel machines. This approach leads to a highly interpretable model and allows using the Bayesian framework for model adaptation and incorporating the prior knowledge about the problem. The GP framework is successfully applied to regression, classification, and dimensionality reduction problems. Unfortunately, the standard methods for both GP-regression and GP-classification scale as O(n^3), where n is the size of the dataset, which makes them inapplicable to big data problems. A variety of methods have been proposed to overcome this limitation both for regression and classification problems. The most successful recent methods are based on the concept of inducing inputs. These methods reduce the computational complexity to O(nm^2) where m is the number of inducing inputs with m typically much less than n. The present authors focus on classification. The current state-of-the-art method for this problem is based on stochastic optimization of an evidence lower bound (ELBO) that depends on O(m^2) parameters. For complex problems, the required number of inducing points m is fairly big, making the optimization in this method challenging. Methods: The structure of variational lower bound that appears in inducing input GP classification has been analyzed. First, it has been noted that using quadratic approximation of several terms in this bound, it is possible to obtain analytical expressions for optimal values of most of the optimization parameters, thus sufficiently reducing the dimension of optimization space. Then, two methods have been provided for constructing necessary quadratic approximations: one is based on Jaakkola--Jordan bound for logistic function and the other is derived using Taylor expansion. Results: Two new variational lower bounds have been proposed for inducing input GP classification that depend on a number of parameters. Then, several methods have been suggested for optimization of these bounds and the resulting algorithms have been compared with the state-of-the-art approach based on stochastic optimization. Experiments on a bunch of classification datasets show that the new methods perform the same or better results than the existing one. However, new methods do not require any tunable parameters and can work in settings within a big range of n and m values, thus significantly simplifying training of GP classification models.
Mikheeva, A.V., I.I. Kalinnikov. 2017. The GIS-ENDDB algorithms and methods for geoinformation-expert data analysis 3(1):36 - 49. doi: 10.21469/22233792.3.1.03 The software of the geographical information system for studying the Earth's natural disasters (GIS-ENDDB) is focused on the research into cause-and-effect relations of catastrophic events in our planet's history. It contains data on the Earth's seismic activity, anomalies of heat flows (HF), gravitational field and tomography layers, detailed geographical relief, as well as data on cosmogenic structures distribution. To develop methods for analyzing these data, it has been added into the subsystems of information and mathematical software updates such as: the algorithm for building seismicity lineaments in terms of the Great circles (GC) of the Earth; the algorithms for constructing the contours of a maximum earthquake magnitudes and of the averaging earthquake mechanisms; the functions of geophysical fields visualization and the cross-section visualization of different seismicity characteristics; and tomography data. All these updates help to extend the capabilities of classical methods for geotectonic studies by a complex scientific-experimental approach allowing one to reveal tectonically active faults, to study the spatial relationship of seismicity and cosmogenic paleostructures (related to the historical past of the Earth), and, eventually, to interpret the data in terms of constructing seismic-geodynamic models of the lithosphere.
Dvoenko, S.D., D.O. Pshenichny. 2017. The conditionality of matrices of pairwise comparisons after metric corrections 3(1):50 - 60. doi:10.21469/22233792.3.1.04 In modern intelligent data analysis and data mining, results of investigations are usually represented by mutual pairwise comparisons of similarity or dissimilarity of objects. It needs to immerse results of pairwise comparisons into some metric space for correct using of machine learning algorithms. One of the conditions of the correct immersion is the nonnegative definite matrix of pairwise similarities of set elements. In this case, nonnegative similarities represent scalar products of vectors in the positive quadrant of an imaginary feature space and corresponding dissimilarities represent distances. Various similarity and dissimilarity measurements are used in practice. Nevertheless, not all of them are correct as metric functions. Therefore, it needs to use metric corrections of real experimental matrices of pairwise comparisons to reach the positive definiteness of the corresponding matrices of standard scalar products. Unfortunately, the natural best limit to minimize deviations of corrected values from initial ones leads to ill-conditioned matrices of scalar products with the large condition number. A way to improve the conditionality of matrices of pairwise comparisons is investigated.
Karkishchenko, A.N., V.B. Mnukhin. 2017. Gaussian rotations for graphic information protection 3(1):61 - 75. doi: 10.21469/22233792.3.1.05 Digital images over “finite complex planes” are considered jointly with transformations of Gaussian rotations. It is proved that under some special conditions, results of such transformations seem to be formed by several zoomed out copies of the rotated original, though all such “copies” are formed by different pixels of the original image. Based on Gaussian rotations, some methods for tamper resistent protection of graphic information are considered. A method for verification of protected information is also introduced.
Ganebnykh, S.N., M.M. Lange. 2017. On efficiency of fusion schemes for pattern recognition in ensemble of images 3(1):76 - 89. doi: 10.21469/22233792.3.1.06 In an ensemble of image sources of different modalities, some metric multiclass classifiers are studied. The classifiers make the collective decisions for the composite objects that are produced by collections of the images with one from each source. The discriminant functions of the multiclass classifiers are produced by the binary “class-vs-all” NN or SVM classifiers. Two original fusion schemes that use the discriminant functions based on the different compositions are suggested. The first scheme uses the compositions of the dissimilarity measures between the images within each source (General Measure, GM) whereas the second scheme uses the compositions of the soft decisions for the images in the submitted composite object (General Similarity, GS). In terms of error rates, the proposed GM and GS fusion schemes are compared with the known MV (Majority Vote) scheme which is based on majority voting the compositions of the hard decisions for the individual source images. The comparative efficiency of the above fusion schemes is supported by the error rates for recognition of the RGB face images given by the ensemble of their three decorrelated components. For the NN and SVM classifiers, the experimental estimations of the error rates show a profit of the GM and GS schemes in comparison with the MV scheme.
Vol. 3, №2, 2017
Chukanov, S.N., S.V. Leykhter. 2017. Learning on affine groups for tracking images of objects 3(2):96 - 106. doi: 10.21469/22233792.3.2.01 Algorithms for tracking of objects and recognition of the behavior of objects based on the control of spatial and time changes of parameters using the learning methods are considered in the paper. Tracking algorithms, in which Lie groups are used for affine transformations, are proposed. The parameters of the object’s motion determined by means of the exponentialmapping between the Lie group and its algebra are analyzed. The parameters are optimized on the manifold. Algorithms for joint learning and estimation with the help of the Luenberger observer for tracking problems on manifolds are presented.
Genrikhov, I.E., E.V. Djukova, and V.I. Zhuravlyov. 2017. Construction and investigation of full regression trees in regression restoration problem in the case of real-valued information 3(2):107 - 118. doi: 10.21469/22233792.3.2.02 Background. The regression restoration problem with real data is considered. Typically, this type of information is most often encountered in practice (for example, in problems of medical diagnosis or banking scoring). The approach based on the construction of regression trees is highlighted among the existing approaches. The most known among algorithms of regression trees synthesis (for example, algorithms CART (classification and regression tree) and Random Forest) are based on use of the elementary trees, namely, binary regression trees. As a rule, in this case, in the synthesis of regression trees, the current values of the feature are splitted by only one threshold at each step of constructing the inner node of the tree. Sometimes, a splitting into several thresholds is performed (interval splitting), while searching for optimal splitting intervals computationally is a complex task. During the synthesis of such trees, only one feature and the corresponding set of thresholds are selected at each step, which satisfies the selected branch criterion, and branching is performed based on it. However, if several different pairs (a feature - a set of transcoding thresholds) satisfy the branching criterion in equal or almost equal measure when building a tree, then only one of them (in fact, randomly) is selected. Thus, depending on the selected feature and the set of thresholds, the constructed trees can differ significantly, both in the composition of the features used and in their recognizing qualities. Methods. An approach to the construction of regression trees based on the construction of the so-called full decision tree is applied. In addition, various ways of selecting a set of thresholds for feature at each step of tree synthesis have been investigated. Previously, this approach was investigated only on regression restoration problem with integer data and showed an improvement in the quality of the solution in comparison with the known methods of synthesis of regression trees. In a full regression tree, a so-called full node on each iteration for a problem with real features is constructed. A set of pairs of feature-threshold transcoding corresponds to it, in which each pair of feature-threshold satisfies the selected branching criterion. Further, a simple inner node is constructed for each pair from this set, from which branching is performed. Compared to the classical construction and the standard method of selecting only one threshold for the feature, full regression tree allows fuller use of the available information, while the description of the recognized object can be generated not only by one branch as in a classical tree, but by several branches. Results. Two synthesis algorithms of regression trees - DFRTree (defined full regression tree) and RFRTree (random full regression tree) - are developed. The RFRTree algorithm constructs a full regression tree in which the best set of thresholds for the feature is selected at each step of the synthesis by using a statistical criterion of estimating various random partitions. The DFRTree algorithm also constructs a full regression tree, but selects the set of thresholds for the feature for which approximately the same number of training objects fall into the resulting intervals. It is shown that the best results were obtained using the RFRTree algorithm. A comparison of 15 real problems of RFRTree and DFRTree algorithms with known regression trees synthesis algorithms, such as the Random Forest, Decision Stump, REPTree, and CART, is carried out. It is shown that quality of the RFRTree algorithm is higher than the quality of the Decision Stump, REPTree, and CART algorithms, and it is as good as the Random Forest algorithm and, in some cases, shows the best results. Concluding Remarks. It is shown that the developed algorithms for the synthesis of full regression trees for solving the regression restoration problem with real data are not inferior to the known algorithms for the synthesis of regression trees and can be successfully applied on a par with other modern approaches to constructing regression trees.
Kniaz, V.V., O.V. Vygolov, V.V. Fedorenko, and V.S. Sevrykov. 2017. Deep convolutional autoencoders: stereo matching for 3D model reconstruction of low-textured objects 3(2):119 - 134. doi: 10.21469/22233792.3.2.03 Methods: A new method for stereo matching based on deep neural convolutional autoencoders is presented. An autoencoder reduces the image dimensions and produces the code that could be used to perform an effective search of the corresponding image patch for low-textured object. Results: An architecture of a new autoencoder was developed. The autoencoder performs coding and decoding of color images with resolution 32 × 32 pixels. A comparison of the performance of the developed method and modern image patch descriptors is presented. The method was applied to process images and to reconstruct three-dimensional (3D) models of archaeological excavations organized by the Bosphorus expedition of the Russian State Historical Museum. Concluding Remarks: The analysis of an application of the developed method proves that it outperforms the existing image descriptors in the matching of image patches of low-textured objects.
Murashov, D.M. , F.D. Murashov. 2017. Method for localizing informative regions with texture of a special type 3(2):135 - 150. doi: 10.21469/22233792.3.2.04 The paper deals with a problem for localizing informative regions with a specific texture in digital images. This type of texture is characterized by uniformly oriented elongated elements and varying spatial frequency. Such a structure, in particular, can be generated by groups of brushstrokes in the images of paintings. Existing techniques, for example, based on the Haralick’s features, Laws energy features, and Gabor filters cannot completely solve the problem with the required quality. In this paper, the task of localization of informative areas will be addressed as a problem of segmentation of texture images. A method for solving the problem based on modified superpixel segmentation algorithm with a postprocessing procedure is proposed. Vector of image pixel description is expanded by texture features computed using components of the structure tensor. The selected features involve the peculiarities of the considered texture type. Application of superpixel algorithm with an extended feature description of images will permit to take into account spatial, color, and textural properties of image regions. To obtain an acceptable quality of segmentation, the condition of minimum information redundancy measure is used. A computational experiment has been carried out on textural test images. The results of segmentation of the image of the texture mosaic by the proposed method have been compared with the well-known method based on the Laws’s energy features. Comparison demonstrates the advantage of the proposed method. The developed technique has been used to localize informative areas in the images of paintings. The results of the experiment show the efficiency of the proposed method.
Nedel'ko, V.M.. 2017. Estimation of feature importance for quantile regression 3(2):151 - 159. doi: 10.21469/22233792.3.2.05 There are a large number of approaches to estimating the significance of variables in problems of constructing decision functions. One of the most important approaches is based on of the ROC (relative operating characteristics) curve (error curve). Initially, the ROC curves were introduced for classification models. The extension of ROC curves for regression problems has also been investigated. Notable examples are the so-called regression error characteristics (REC) curves and the Regression ROC (RROC) curves. However, these generalizations require the explicit specification of the predicted values of the target variable, while for constructing the ROC curve in the classification problem, one needs only the ordering of objects. There are also some other differences in essential properties of such regression ROC curves and classification ROC curves. The present author proposes some natural generalizations of the concept of the ROC curve for regression analysis, which reproduce more fully the properties of the ROC curve as compared to the known extensions. The most important of these properties is that the ROC curves move to a straight line, when built on random prediction. The deviations from the line allow one to estimate the importance of a variable. The proposed variants of the ROC curve for regression were found to be close to the construction of the empirical bridge.
Gasanov, E.E., A.P. Motrenko. 2017. Creation of approximating scalogram description in a problem of movement prediction 3(2):160 - 169. doi: 10.21469/22233792.3.2.06 The paper addresses the problem of a thumb movement prediction using electrocorticographic (ECoG) activity. The task is to predict thumb positions from the voltage time series of cortical activity. The scalograms are used as input features to this regression problem. Scalograms are generated by the spatio-spectro-temporal integration of voltage time series across multiple cortical areas. To reduce the dimension of a feature space, local approximation is used: every scalogram is approximated by parametric model. The predictions are obtained with partial least squares regression applied to local approximation parameters. Local approximation of scalograms does not significantly lower the quality of prediction while it efficiently reduces the dimension of feature space.
Vol. 3, №3, 2017
Fedotov, N.G., A.A. Syemov, A.V. Moiseev. 2017. Perfomance investigation of 3D image recognition by stohastic geometry methods in dependent on the number of reference points on the sphere 3(3):176 - 192. doi: 10.21469/22233792.3.3.01 Background: A new developed approach to the three-dimensional (3D) images recognition, giving the object invariant description for any its spatial orientations is proposed. This method has many advantages and 3D images data mining capabilities. In particular, in parallel with the spatial object recognition, it is possible to analyze the original image. Due to building a rigorous mathematical model, it is possible to design analytically features with predetermined properties. Methods: The suggested approach is based on the modern methods of stochastic geometry and functional analysis. Hypertrace transform creates a 3D trace-image of the original spatial object due to scan of the parallel planes grid from different view angles. Created on this trace-image basis, hypertrace matrix is a convenient tool for analyzing 3D images in contrast to other mathematical methods. Results: Stochastic scan with random parameters is more efficient than the determinate scan in terms of the 3D images recognition “reliability-performance” relation. The conducted experiments results are shown. These results demonstrate both theoretical and practical significance and effectiveness of the proposed method. Concluding Remarks: The evaluation task of 3D image recognition performance independent on the number of reference points on the sphere with the use of various kinds of scanning are analyzed. Potential further ways to accelerate the recognition system are proposed.
Starozhilets V.M., U.V. Chehovich. 2017. About identification of a statistical model of traffic flows using vehicle groups 3(3):193 - 202. doi: 10.21469/22233792.3.3.02 A statistical model of traffic flows for modeling speed and number of cars on highways identified on data from heterogeneous sources is proposed. The model simulates movement of car groups along the highway using corresponding to the selected road segment fundamental diagram to calculate the car groups speed. Computational experiments are provided to confirm the adequateness of the model. Also, its behavior in situation of blocking one of the lanes of the highway is analyzed. The criterion of quality is the root-mean-square error between the predicted number of passed vehicles and the actual number of vehicles. Data from traffic detectors, provided by Traffic Management Center, and data obtained by video recording are used in this study.
Samsonov N.A., A.N. Gneushev. 2017. Textural descriptor in the Hough accumulator space of the gradient field for detecting pedestrians 3(3):203 - 215. doi: 10.21469/22233792.3.3.03 The problem of selecting features for recognizing pedestrians on an image is considered. The most popular and effective approach to selecting features is using descriptors based on Histograms of Oriented Gradients (HOG). In this paper, it is proposed to use the Hough accumulator space to generalize the HOG descriptor by obtaining projection not only of orientations, but also of the positions of the boundaries in the local area of the image Hough Accumulator Histograms (HAH). The Hough accumulator space is built on the basis of the beam Radon transform of the gradient field of the image. The proposed methods were tested together with linear support vector machine (SVM) classifier on the INRIA pedestrian database. The results of the experiment have shown the best separating ability of new descriptors and reduction of false detections in comparison with HOG..
Vol. 3, №4, 2017
Sarmanova O.E., S.A. Burikov, S.A. Dolenko, I.V. Isaev, V.A. Svetlov, K.A. Laptinskiy, T.A. Dolenko. 2017. Estimation of the perspective of using machine learning methods for the purpose of monitoring of the excretion of theranostic fluorescent nanocomposites out of the organism 3(4):222 - 238. doi: 10.21469/22233792.3.4.01 Background: At present, development of new nanomaterials that can be used for diagnostics and medical treatment simultaneously is utterly relevant in biomedicine. While using such agents, one has to control their excretion out of the body. Methods: The results of the estimation of the perspective for application of machine learning methods for monitoring of the excreted theranostic nanocomposites (carbon dots, covered by copolymer and folic acid) and their components by their fluorescence spectra in urine are presented. The problem was solved as a clusterization problem (by k-means and by the algorithm of adaptive construction of hierarchical neural classifiers, developed by the authors), and as a classification problem (by neural networks). None of the clusterings revealed sensitivity to the types of nanoparticles contained in the suspension. Results: The best results of the solution of the classification problem were provided by a perceptron with 8 neurons in the single hidden layer, trained on the set of significant input features selected by cross-correlation. Recognition accuracy averaged over all five classes was 72.3%.
Djukova, E.V., G.O. Maslyakov, P.A. Prokofjev. 2017. About product over partially ordered sets 3(4):239 - 249. doi: 10.21469/22233792.3.4.02 One of the central, intractable problems of logical data analysis - the dualization over the product of partial orders - is considered. An important special case is investigated, when each order is a chain. The relevance of this case is determined by the large number of applications, of which, first of all, it is necessary to allocate such areas as machine learning and search for associative rules in databases. If the number of elements in each chain is two, then the dualization over the product of chains reduces in a well-known way to the construction of irreducible coverings of the Boolean matrix (the dualization of the Boolean matrix). In this paper, it is shown that in general case, when the number of elements in each chain is greater than two, the posed problem reduces to finding a subset of the set of irreducible coverings of a special Boolean matrix the size of which increases with increasing number of elements in the chain (the number of columns of the matrix grows linearly). The results of numerical experiments based on the effective “in the typical case” (for almost all variants of the problem) asymptotically optimal search for irreducible coverings of the Boolean matrix are presented. The algorithm of dualization of boolean matrix Runc-M, constructed by E. V. Djukova and P. A. Prokofjev earlier, is modified for the experiments. Runc-M is currently the world leader in counting speed. Previously, to solve the problem of dualization over the product of chains, an approach was proposed of interest mainly for the theory and is aimed at constructing incremental algorithms with quasi-polynomial time estimates “in the worst case” (E. Boros,K. Elbassioni, V. Gurvich, L. Khachiyan, and K. Makino, 2002).
Karyakina, A.A., A.V. Melnikov. 2017. Comparison of methods for predicting the customer churn in Internet service provider companies 3(4):250 - 256. doi: 10.21469/22233792.3.4.03 The possibility of forecasting the churn of customers based on the data of the Russian Internet service providers (ISP) has been considered. The basic approaches to preprocessing of archived data are defined. For comparison, classification algorithms are used: decision trees, random forest, naive Bayesian algorithm, gradient boosting, and the method of k-nearest neighbors for prediction. As the first sample, an experimental array of input data of size 6 × 400 000 was formed, which contains the fields from the calls (id, type of service, feature, reason, type of result, and leaving). As the second sample, an array of input data of size 13 × 400 000 was formed. For it, there have been selected the following features: id, count of calls for each type of service, for each type of result, total count of calls from the client, and leaving. The models for prediction with the best parameters have been constructed. In the tables, the results of the research with different data sets for various classifiers are shown.
Chigrinskiy V.V., I.A. Matveev. 2017. Iris structure motion analysis via optical flow method 3(4):257 - 266. doi: 10.21469/22233792.3.4.04 Nonlinear movements of elements of human iris during pupil size variations is studied. Tracking of iris elements is done with the help of optical flow methods. The aim is to estimate a radially symmetric function which describes positions of iris structural elements with respect to pupil size. The quality of the method is assessed by applying it to the synthetic data, which is built from preselected deformation model and after that, obtained function is matched against the model. To test the algorithm on real data, video of human’s eye reaction on flashlight is used, which was recorded by a special device.
