### Vol. 1, №3, 2012

Kushnir O.A. *Binary bitmaps shape comparison based on skeletization* // Journal of Machine Learning and Data Analysis. 2012. V. 1, № 3. Pp. 252-263. The problem to solve is to compare shapes of the binary bitmaps using the skeletal graphs. We analyzed existing approaches to the skeletal graphs comparison. This approaches are to apply to this graphs various methods of classification basing on the features, measures, metrics. Also the considered problem is to find a skeletal graphs space metrics which lets to compare the shapes of the arbitrary objects effectively in the real time by using the general classificator based on the support vectors method.

Belomestny D., Panov V., Spokoiny V. *Semiparametric estimation of the signal subspace* // Journal of Machine Learning and Data Analysis. 2012. V. 1, № 3. Pp. 264-271. Let a high-dimensional random vector X be represented as a sum of two components - a signal S that belongs to some low-dimensional linear subspace S, and a noise component N. This paper presents a new approach for estimating the subspace S based on the ideas of the Non-Gaussian Component Analysis. Our approach avoids the technical difficulties that usually appear in similar methods - it requires neither the estimation of the inverse covariance matrix of X nor the estimation of the covariance matrix of N.

Tselykh V.R. *Multivariate adaptive regression splines* // Journal of Machine Learning and Data Analysis. 2012. V. 1, № 3. Pp. 272-278. The article describes multivariate adaptive regression splines, which are very useful for high dimensional problems and show a great promise for fitting nonlinear multivariate functions. This technique does not impose any particular class of relationship between the predictor variables and outcome variable of interest. The error of approximation in relation to the model complexity is investigated. To illustrate the method test data, ECG data and information from the area of financial mathematics are used.

Aduenko A.A. *Feature selection and stepwise logistic regression for credit scoring* // Journal of Machine Learning and Data Analysis. 2012. V. 1, № 3. Pp. 279-291. The article is dedicated to the selection of the optimal set of features for determining the quality of bank loans' requests. The default probability is estimated to answer this question. The stepwise regression is used for the feature selection. The dependency of an informativity of the selected features on the stepwise regression parameters is studied. In the computational experiment the algorithm described in the paper is tested on the data of consumers who applied for loans in a certain bank and also on data about clients' response to bank's marketing campaign.

Medvednikova M.M. *Principal component analysis for building integral indicators* // Journal of Machine Learning and Data Analysis. 2012. V. 1, № 3. Pp. 292-304. The main goal of this work is to present principal component analysis for integral indicators construction. Derived results are compared with Pareto slicing method's results. The integral indicator for Russian universities is built by using biographies of 30 the richest Russian businessmen according to magazine "Forbes" in 2011.

Romanenko A.A. *Feature selection and stepwise logistic regression for credit scoring* // Journal of Machine Learning and Data Analysis. 2012. V. 1, № 3. Pp. 305-310.The article suggests a method of clustering text collection based on classical algorithms of clustering, for example, K-mean. The authors consider metric between texts taking into account similarity of their vocabularies. Also the authors investigate applicability of this metric to measure distance between real texts. The computational experiment compares results of clustering with given distribution of texts over the set of topics.

Tsyganova S.V. *Methods of local forecasting with transormation accounting* // Journal of Machine Learning and Data Analysis. 2012. V. 1, № 3. Pp. 311-317. This paper considers the algorithm of local forecasting with transformation, which reveals similar intervals of time series in introduced metrics. A conception of invariant transformations is considered, and also the choice of the most suitable for forecasting problem. The work is illustrated by the data of energy consumption and synthetic data.

Kuzmin A.A. *Multi-level classification upon detection of price movement* // Journal of Machine Learning and Data Analysis. 2012. V. 1, № 3. Pp. 318-327. This research describes one of the possible methods of forecasting, which is based on logistic regression model. A method of marking the time series beam and building a matrix of attributes and objects is proposed. Algorithm is tested on synthetic time series beams, which have the form of noisy sine and periodic trapezium. As the variant of practical application, algorithm is tested on energy consumption data.

Klochkov Y.Y. *Quasiperiodic time series forecast using nonparametric methods* // Journal of Machine Learning and Data Analysis. 2012. V. 1, № 3. Pp. 328-334.This paper considers the nonparametric method of quasi-periodic time series forecasting. As a method the quantile regression is used. Its advantage is that, despite its simplicity, it is a good approximation of many known distributions. The proposed method is tested on data from the sales of products.

Leonteva L.N. *Feature selection in autoregression forecasting* // Journal of Machine Learning and Data Analysis. 2012. V. 1, № 3. Pp. 335-346. The authors investigate the optimal model selection problem with application to the auto-regression forecasting. To solve the problem one has to select a maximum well-defined feature subset, subject to some given value of the error function. To select the feature set the modified add-del feature selection algorithm is used. This paper suggests a method of time series forecasting model selection. The computational experiment compares the electricity hourly prices forecasts.

Zaytsev A.A., Tokmakova A.A. *Estimation regression model hyperparameters using maximum likelihood* // Journal of Machine Learning and Data Analysis. 2012. V. 1, № 3. Pp. 347-353. The papers considers the regression model selection problem. The model parameters are supposed to be a multivariate random variable with independently distributed components. A method for hyperparameters optimization is proposed. Direct way to obtain the hyperparameter estimations is shown. The papers illustrated the usage of the hyperparameters in the feature selection problem. The suggested method is compared with the Laplace approximation method.

Motrenko A.P. *Bayesian sample size estimation for logistic regression* // Journal of Machine Learning and Data Analysis. 2012. V. 1, № 3. Pp. 354-366.The problem of sample size estimation is important in the medical applications, especially in the cases of expensive measurements of immune biomarkers. The paper describes the problem of logistic regression analysis including model feature selection and includes the sample size determination algorithms, namely methods of univariate statistics, logistic regression, cross-validation and Bayesian inference. The authors, treating the regression model parameters as the multivariate variable, propose to estimate sample size using the distance between parameter distribution functions on cross-validated data sets.

Varfolomeeva A.A. *Local forecasting with metrics selection* // Journal of Machine Learning and Data Analysis. 2012. V. 1, № 3. Pp. 367-375. In this article the local method of time series prediction is considered. The method is based on the algorithm of k nearest neighbors. The author investigates the question of the choice of metrics in order to find similar parts of the series. A comparison of the effectiveness of the algorithm for constructing prediction using different metrics is illustrated on synthetic data and time series of electricity consumption and sugar prices.

Budnikov Y.A. *The estimation of probabilities of appearance of word strings in a natural language* // Journal of Machine Learning and Data Analysis. 2012. V. 1, № 3. Pp. 376-386. This article considers the issue of the estimation of probabilities of appearance of word strings in a natural language. N-gram language models are used for solving this issue. Class-based language models are used for solving the problem of huge amount of parameters. Good-Turing estimates, Katz smoothing and absolute discounting smoothing are used for solving the problem of «unseen» words. Basic definitions are introduced the methods and the algorithm ofconstructing of the classes in class-based language models are described. The work is illustrated by the experiments in the synthetic data.

### Vol. 1, №4, 2012

Zhukova K.V., Reyer I.A. *Parametric family of skeleton bases of a polygonal figure* // Journal of Machine Learning and Data Analysis. 2012. V. 1, № 4. Pp. 391-410. In the paper a skeleton base is considered. A skeleton base is a stable skeletal shape representation constructed with use of a polygonal figure approximating the shape. The monotonicity and continuity of change of a skeleton base with growth of the approximation accuracy value is investigated. A concept of a skeleton markup is presented. A skeleton markup is a set of points of a polygonal figure's skeleton describing the change of a skeleton base and allowing one to build skeleton bases for a given set or range of approximation accuracy values.

Kuznetsov M.P. *Integral indicator construction using copulas* // Journal of Machine Learning and Data Analysis. 2012. V. 1, № 4. Pp. 411-419. We construct an integral indicator of the IUCN Red List of Threatened species. Method of an integral indicator construction based on copulas which describe statistical bounds between the features. We propose a two-step algorithm of the parameters estimation. On the first step we estimate parameters of a marginal distribution of the features. On the second step we estimate copula parameters.

Burmistrov M.O., Sanduleanu L.N. *Probabilistic model for one-class classification problem* // Journal of Machine Learning and Data Analysis. 2012. V. 1, № 4. Pp. 420-427. One-class classification methods are used to test e-mails for spam. Quasi-probabilistic model is introduced for traditional empirical approach to problem. The old model is shown to be a reduction of the new one. Built approaches to classification are numerically tested on model and real data.

Motrenko A.P. *Joint probability density estimation* // Journal of Machine Learning and Data Analysis. 2012. V. 1, № 4. Pp. 428-436. When solving a classification problem one often has to deal with both discrete and continuous variables. for example, in the logistic regression independent variables are distributed continuously, while a target variable follows Bernoulli distribution. In this paper a method is resented that allows to estimate joint probability distribution which include discrete and continuous variables. A case when no probabilistic assumptions can be made is considered. The methods of nonparametric regression are used. Also a comparison to the classic methods of probability theory is presented. The experiment is conducted on the real and synthetic data.

Tselykh V.R., Vorontsov K.V. *Goodness-of-fit tests for sparse multinomial distributions with application to topic modeling* // Journal of Machine Learning and Data Analysis. 2012. V. 1, № 4. Pp. 437-447. Pearson’s goodness-of-fit test is not appropriate for sparse multinomial distributions. In this case the distribution of statistic is not asymptotically chi-squared, depends on a sample size and on a form of the tested distribution. The article suggests statistical criteria based on empirical distribution of a statistic obtained from sampling. Their application to text analysis is considered, in particular, to testing the conditional independence hypothesis for probabilistic topic models evaluation.

Valkov A.S., Kozhanov E.M., Medvednikova M.M., Husainov F.I. *Non-parametric forecasting of railroad stations occupancy according to historical data* // Journal of Machine Learning and Data Analysis. 2012. V. 1, № 4. Pp. 448-465. The authors propose a method of non-parametric forecasting of railroad stations occupancy according to historical data. The algorithm is based on convolution of empirical density of distribution of time series values and loss function. The features of autoregressive prognostic model are investigated. The algorithm is illustrated by railroad stations occupancy data in Omsk region in 2007 and 2008.

Zhivotovskiy N.K. *Combined generative and descriminative approach for calssification with a small learning set* // Journal of Machine Learning and Data Analysis. 2012. V. 1, № 4. Pp. 466-472. This paper deals with two statistical approaches to solving classification problems and way of their combination designed to evaluate the parameters of a classifier for samples of different cardinality. The combined descriminative and generative model was built for the case of the multivariate normal distribution of objects within classes. This model shows lower probability of error of classificator as compared with one obtained purely from generative or descriminative model when restrictions are put on the size of the learning set

Vasileisky A.S., Karatsuba E.A., Karelov A.I., Kuznetsov M.P., Reyer I.A. *The algorithm of persistent scatterers detection on the satellite radar images of the earth surface* // Journal of Machine Learning and Data Analysis. 2012. V. 1, № 4. Pp. 473-484. We consider a problem of the radar signal persistent scatterers detection on the earth surface. To detect the scatterers we use satellite SAR images consisting of the amplitude and phase components. To identify scatterers coordinates we use amplitude component. Phase component is used to determine scatterers movement due to the terrain shifts. We propose a blob detection algorithm to find the scatterers. To illustrate the algorithm we use synthetic and real data. We describe a method of the satellite images processing, a method of the synthetic data construction and verification and method of the persistent scatterers system detection.