# Research

- Types of Learning
- Theories and Algorithms of Machine Learning
- Model Selection
- Active Learning
- Incremental Learning / Online Learning
- Dimensionality Reduction
- Learning from Similarity Data / Kernel Methods
- Learning under Different Distributions
- Importance Estimation / Density Ratio Estimation
- Mutual Information Estimation / Divergence Estimation
- Outlier Detection
- Independent Component Analysis
- Causal Inference
- Conditional Probability Estimation
- Risk-sensitive Learning
- Low-rank Approximation

- Applications of Machine Learning

## Types of Learning

### Supervised Learning

Supervised learning is the problem of learning an underlying input-output relation based on input-output samples. Once the underlying relation can be successfully learned, answers for unlearned questions can be predicted. Thus, the learning machine can generalize to unexperienced situations. The goal of supervised learning research is to let the learning machine acquire the best generalization performance from the smallest possible number of training samples. The supervised learning problem can be formulated as a function approximation problem from samples.### Unsupervised Learning

Unsupervised learning is literally learning without supervisors; only input data without output data (answer) are given. The goal of unsupervised learning depends on situations and the unsupervised learning problem is sometimes not mathematically well-defined. Data clustering aiming at grouping similar data is a typical example. In data clustering, how to measure the affinity between data samples needs to be predetermined, but there is no objective criterion that can quantitatively evaluate the validity of the affinity measure; often it is merely subjectively determined.### Semi-supervised Learning

Input-output samples are given in supervised learning and input-only samples are given in unsupervised learning. Semi-supervised learning is in the middle of these two situations---in addition to the input-output samples, input-only samples are also provided. The goal of semi-supervised learning is the same as supervised learning, i.e., to let the learning machine acquire the best generalization performance.In semi-supervised learning, the number of input-output samples is typically small, while the number of input-only samples is very large. In such situations, the generalization performance may be further improve by utilizing a large number of input-only samples in addition to a small number of input-output samples.

### Reinforcement Learning

Reinforcement learning is the problem of learning a policy function (a mapping from a state to an action) of an agent. The policy function is an input-output relation so the goal of reinforcement learning is the same as supervised learning. However, different from supervised learning, the output data can not be directly observed. Therefore, the policy function needs to be learned without supervisors; but different from unsupervised learning, a reward is given for an agent's action. The goal of reinforcement learning is, based on the information of the reward, to learn the policy function such that the sum of rewards the agent will receive in the future is maximized.A typical approach to reinforcement learning does not directly learn the policy function itself; it learns the value function, a mapping from a state (and an action) to the sum of rewards. The desired policy function can be generated based on the value function. (book in Japanese: Training Robotic Game Players by Reinforcement Learning), (book: coming soon).

## Theories and Algorithms of Machine Learning

### Model Selection

In order to acquire a higher level of generalization capability in supervised learning, it is important to determine the complexity of learning machines (models) appropriately. If the model is too simple, it is not flexible enough to represent the learning target function and therefore the generalization performance can not be improved even if a large amount of training data is employed for learning. On the other hand, if the model is too complex, it is capable of representing the target function, but such a complex model is heavily affected by the noise in the training data. Therefore, in practical situations with a rather small number of training samples, complex models do not give better generalization performance.Model selection is one of the most fundamental and central research topics in supervised learning. Model selection is generally carried out by finding the model such that the generalization performance is maximized. Therefore, the key of model selection research is how to derive an accurate estimator of the generalization performance.

#### Unbiased Model Selection Criteria

We tackled this model selection problem by a functional analytic approach and developed an estimator of the generalization error called the subspace information criterion (SIC) (Neural Computation, 2001). SIC has shown to work well because of its data-dependence; thus SIC tends to outperform existing methods in single trials (Machine Learning, 2002). SIC was originally proposed as a criterion for selecting subspace models. Later, we have extended SIC to be able to choose the regularization parameter in linear ridge regression (Neural Networks, 2002), the regularization parameter in sparse linear regression (IEEE Transactions on Neural Networks, 2002), and the regularization parameter in kernel ridge regression (Journal of Machine Learning Research, 2002). Furthermore, we theoretically evaluated the performance of SIC when the training input points also contain noise (Neural Information Processing - Letters and Reviews, 2004), and extended the range of application of SIC so that non-linear parameter learning methods are also allowed (IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 2007). For the above research works, the Young Researcher Award was given from Japanese Neural Network Society (JNNS) in 2001, the Information Technology Award from Funai Foundation for Information Technology in 2003, the Nakamura Research Award from Tejima Foundation for Industry and Education in 2003.#### Regularized Model Selection Criteria

Including SIC, the performance of model selection criteria is usually evaluated by their unbiasedness. However, unbiasedness doe not necessarily imply good model selection performance; it is important to take the variance of the model selection criteria into account. Based on this idea, we proposed the corrected SIC (cSIC), which is more accurate than the original SIC in the mean squared error sense (IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 2003). Furthermore, we developed the regularized SIC (RSIC) which can more significantly reduces the variance of SIC by slightly sacrificing its unbiasedness (Neural Computation, 2004). RSIC remarkably improves the stability and reliability of model selection. We also proposed methods of analytically obtaining the best model under RSIC that can improve the model selection accuracy and computational efficiency at the same time; for shrinkage learning (IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 2006) and for adaptive ridge learning (IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 2007). RSIC includes a regularization parameter in the model selection criterion, which is optimized based on a meta model selection criterion. We improved the meta criterion and showed that the model selection performance is further improved (IEICE Transactions on Information and Systems, 2007).### Active Learning

If training input location (question) is designed appropriately in supervised learning, one can obtain better generalization capability with a small amount of training data. The problem of designing training input location is called active learning (or experimental design). The challenge of active learning is to predict the generalization performance of a learning machine before training data is gathered. Therefore, active learning is thought to be a more difficult problem than model selection; particularly how to cope with the bias of the estimator, which can not be generally predicted without accessing to training data, is the key issue in active learning.#### Active Learning for Individual Models

Standard active learning methods design the input location so that the variance of a learning machine is minimized under the assumption that the model at hand is correctly specified (i.e., the learning target function is included in the model). We showed that this standard approach actually consists of two stages: the first stage corresponds to implicitly reducing the bias, while the second stage is to minimize the variance. However, the existing approach only explicitly solves the second problem---we found that, the variance actually increases in the first stage. To cope with this problem, we proposed a two-stage incremental active learning method (Neural Computation, 2000). This method improves the performance by explicitly suppressing the increase of the variance in the first stage with guaranteeing that the bias vanishes. The thesis that contains this work was selected for the Doctor Dissertation Award by Tejima Foundation for Industry and Education in 2002.The above two-stage method is practically useful. However, it is an incremental method, i.e., the input location is determined one by one in a sequential manner. Therefore, it can only achieve the greedy optimality. Theoretically, batch active learning---designing all input locations at the same time---is globally optimal. Following this fact, we developed a globally optimal batch active learning method when the learning target function is included in trigonometric polynomial models, and elucidated its theoretical properties and geometric structure (IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 2001). For this achievement, we received the Young Engineer Award from The Institute of Electronics, Information and Communication Engineers (IEICE) in 2002.

In the above methods, the model at hand is assumed to be correct. However, this assumption may not be satisfied in practice. Therefore, developing an active learning method applicable to misspecified models (i.e., the learning target function is not included in the model) is practically very important--but is known to be a hard problem since it is difficult to control the bias of misspecified models. For misspecified models, importance-weighted least-squares learning is proved to give an asymptotic unbiased estimator. Based on this fact, we proposed a batch active learning method called ALICE (Journal of Machine Learning Research, 2006). ALICE is guaranteed to give a minimum-variance design of input locations under the constraint that the bias is asymptotically minimized. ALICE has been applied to efficient design of sampling policies in robot control.

ALICE is an effective active learning method when we are allowed to set the input points at any desired location. In some practical situations, however, we are only given a finite number of input points and have to choose the best input points among the candidate points. Such a scenario is called pool-based active learning. We extended ALICE to pool-based ALICE (P-ALICE) and showed its usefulness through extensive simulations (Machine Learning, 2009). PALICE has been applied to semi-conductor wafer alignment in in semiconductor exposure apparatus.

#### Active Learning for Multiple Models

Active learning and model selection have been investigated separately so far, although they share a common goal of improving the generalization performance. If the model and training input location are optimized simultaneously, the generalization capability could be further improved. However, simply combining existing active learning methods and existing model selection methods is not possible for carrying out active learning and model selection at the same time; the model must have been fixed in the existing active learning methods and the training input location must have been determined and corresponding training output values must have been observed in the existing model selection methods. We call this phenomenon the model selection/active learning dilemma.For this challenging problem, we proved that there exist the common optimal training input location for trigonometric polynomial models. Based on this result, we developed an algorithm for active learning with model selection for trigonometric polynomial models that can successfully avoid the model selection/active learning dilemma (IEICE Transactions on Information and Systems, 2003).

Furthermore, we extended the above method to accommodate a general class of models and developed the method of ensemble active learning (Neural Networks, 2009).

### Incremental Learning / Online Learning

When training samples are sequentially given in supervised learning, it is computationally efficient to build the learning result upon the previously learned results. Learning in such an incremental manner is called incremental learning (or on-line learning).We developed an incremental learning method called incremental projection learning (IPL), that only relies on incremental computation but provides exactly the same learning result as that obtained by batch learning with all training samples (Neural Networks, 2001a). We also elucidated theoretical properties of IPL (Neural Networks, 2001b). IPL forms a basis of developing incremental active learning algorithms. Furthermore, we proposed an IPL-based method of constructing neural networks (IEICE Transactions on Information and Systems, 2002). For the above series of works on incremental learning, the Inose Science Promotion Award was given by the Foundation of Electrical, Electronics, and Information Science Promotion in 1998.

Online learning is also important in the context of density ratio estimation. We developed an online version of the density ratio estimation algorithm KLIEP, and showed its effectiveness through change-point detection experiments (SDM2009).

### Dimensionality Reduction / Feature Extraction / Feature Selection

When the dimensionality of the input data is high, it becomes extremely difficult to learn well from the data (known as `the curse of dimensionality'). This difficulty could be alleviated if the dimensionality of the data is reduced. The goal of dimensionality reduction is to reduce the dimensionality of the data without losing the intrinsic information contained in the data. If the low-dimensional components are chosen as a subset of original features, the dimensionality reduction task is called feature selection. On the other hand, if the low-dimensional components are chosen as (possibly non-linear) combination of original features, it is referred to as feature extraction. Feature selection is preferred in scientific applications such as biology and chemistry since interpretability of learned results is more important than predictability. On the other hand, in general engineering applications such as pattern recognition, predictability is important so feature extraction would be more useful.#### Supervised Dimensionality Reduction

The goal of supervised dimensionality reduction is to find a subset of input variables or a subspace of the input space which explains the output values the most.##### Discriminant Analysis

A typical method of supervised dimensionality reduction in classification scenarios would be Fisher discriminant analysis (FDA). When samples in each category follows the normal distribution with common covariance structure, FDA provides the optimal solution. However, samples in some category has separate clusters (i.e., multi-modal), FDA does not give desired solutions. To cope with this problem, we proposed local Fisher discriminant analysis (LFDA) which works well with multi-modal data (Journal of Machine Learning Research, 2007), (software). For this work, the Incentive Award was given by Japanese Society for Artificial Intelligence (JSAI) in 2007.##### Sufficient Dimension Reduction

Sufficient dimension reduction is an approach of supervised dimensionality reduction for both regression and classification, aimed at finding an input subspace which is independent of outputs. Based on a method of density-ratio estimation and developed a sufficient dimension reduction method (AISTATS2010).#### Semi-supervised Dimensionality Reduction

In semi-supervised learning settings where input-only samples are given in addition to input-output samples, we expect to obtain better dimensionality reduction performance by utilizing the input-only samples.For semi-supervised classification scenarios, we developed semi-supervised local Fisher discriminant analysis (SELF) by combining principal component analysis (unsupervised dimensionality reduction) and local Fisher discriminant analysis (supervised dimensionality reduction) (Machine Learning, 2010), (software). Furthermore, for very high-dimensional sparse massive datasets, we developed a computational efficient implementation called sparse SELF (SSELF), and showed its usefulness in document classification, an important task in natural language processing (IEICE Transactions on Information and Systems, 2009), (software).

#### Unsupervised Dimensionality Reduction

Dimensionality reduction is also very important in unsupervised learning scenarios. The goal of unsupervised dimensionality reduction is to obtain a low dimensional representation of the data while preserving the intrinsic information contained in the original data. However, this problem is not mathematically well defined since the meaning of intrinsic information is rather vague. Therefore, it is a cumbersome problem.##### Non-Gaussian Component Analysis

We focused on the situation where the original data can be decomposed into signal components included in a low dimensional linear subspace and noise components since this enables us to rigorously formulate the unsupervised dimensionality reduction problem. If the signal components follow some non-Gaussian distribution while the noise components follow the Gaussian distributions, the projection pursuit (PP) algorithm can identify the signal components. PP needs to prefix a single criterion called projection index that measures the non-Gaussianity of the data. However, it is known that a standard projection index is suitable for searching super-Gaussian distributions while another typical projection index is effective in finding sub-Gaussian distributions. Therefore, if the data contains both super- and sub-Gaussian signal components, it is not obvious how to optimally choose the projection index.To cope with this problem, we proposed a general semi-parametric framework of unsupervised dimensionality reduction called non-Gaussian component analysis (NGCA). Within the NGCA framework, we proposed an algorithm called multi-index projection pursuit (MIPP) (Journal of Machine Learning Research, 2006). MIPP allows us to employ a number of projection indices at the same time. Therefore, one can reduce the dimensionality of the data even when it contains both super- and sub-Gaussian components.

Although MIPP can overcome the traditional weakness of PP, it is not still the optimal method within the NGCA framework. We further developed an algorithm called the iterative metric adaptation for radial kernel functions (IMAK) (Annals of the Institute of Statistical Mathematics, 2007). IMAK can estimate the signal subspace more accurately than MIPP and PP.

We also applied the NGCA method to the noise reduction problem and developed an algorithm for accurately estimating the best linear unbiased estimator (IEICE Transactions on Information and Systems, 2008).

##### Direct Density ratio estimation with Dimensionality Reduction

Dimensionality reduction is also an important topic in the context of importance estimation / density ratio estimation. We proposed a framework called Direct Density ratio estimation with Dimensionality Reduction (D3; D-cube) for finding the subspace in which the numerator and denominator densities in the density ratio are different. We developed a computationally efficient D3 algorithm which combines Local Fisher Discriminant Analysis (LFDA) (Journal of Machine Learning Research, 2007) and unconstrained Least-Squares Importance Fitting (uLSIF)(Journal of Machine Learning Research, 2009). Furthermore, we proposed an algorithm called the Least-squares Hetero-distributional Subspace Search (LHSS) which can identify the subspace in which the discrepancy between the numerator and denominator densities is maximized (Neural Networks, 2011).### Learning from Similarity Data / Kernel Methods

Input samples are directly used for learning in standard supervised learning, but it is possible to use similarities among input sample points. The matrix that contains all pairwise similarities between input sample points is called the similarity matrix. When the similarity matrix is defined through positive semi-definite kernels, it is called the kernel matrix and learning algorihtms using kernel matirces are commonly referred to as kernel methods.#### Kernelization

Any linear algorithms based on inner products can be non-linearized without sacrificing computational simplicity of linear algorithms by replacing the inner products with kernel functions. Applying this technique, we can non-linearize our proposed algorithms: LFDA, SELF, SSELF(dimensionality reduction>), KLIEP, LSIF, uLSIF(density-ratio estimation).#### Speeding-up Kernel Methods

The computational complexity of kernel methods generally depends cubically on the number of training samples, and therefore it is difficult to apply the kernel methods to large-scale datasets. For kernel partial least-squares, we proposed an approximation method which allows us to compute the degrees-of-freedom and error bars efficiently (AISTATS2009).The nu support vector machine is a powerful kernel-based classificaiton algorithm. However, under strict formulation, the nu support vector machine involves a non-convex optimization problem and there is no known method for obtaining its global optimal solution. Based on cutting plane mehtods, we developed a new optimization algorithm which is guaranteed to find the global optimal solution (New Generation Computing, 2009).

We also published a review paper on recent advances in large-scale kernel methods (IEICE Transactions on Information and Systems, 2009).

Sparse learning methods are incresingly popular these days due to its high generalization performance with a large number of explanatory variables. We developed an augmented Lagrangian algorithm for efficient training of sparse regularization models (IEEE Signal Processing Letters, 2009)

#### Robustness

Kernel least-squares fitting is not robust against outliers. We applied the kernelized l_1 fitting method to robot control and showed its usefulness (ICRA2009).#### Design of Similarity Matrices

In order to obtain a better generalization performance, feature selection of input samples is important. However, it is often difficult to explicitly extract good features from input samples. In such cases, a good generalization performance could be obtained if feature selection is implicitly carried out by defining similarities between input samples, rather than explicitly selecting features.Gaussian kernels are general-purpose similarity functions and give good results on the whole. However, if some prior knowledge of the problem domain is available, the generalization performance and computational efficiency could be improved by utilizing the prior knowledge.

We proposed the principal component kernel (PCK) which can be computed based on the prior distribution of the learning target function (IEICE Transactions on Information and Systems, 2006). We constructed a specific PC kernel for the binary regression problem and showed its effectiveness.

Kernel functions are also often employed in value function approximation in reinforcement learning. We proposed the geodesic Gaussian kernel (GGK), which can be constructed based on the manifold structure of the state space. We demonstrated the effectiveness of GGKs in robot arm control and Khepera robot navigation tasks (Autonomous Robots, 2008), (demo).

#### Learning from Multiple Similarity Matrices

Standard kernel methods deal with a single similarity matrix, but we are sometimes given multiple similarity matrices. In such cases, the generalization performance could be improved if the information brought by multiple similarity matrices are combined. We proposed a new method that can learn from multiple similarity matrices. The proposed method is robust in the sense that noisy similarity matrices are automatically de-emphasized (IEEE Transactions on Neural Networks, 2009).#### Learning from Non-positive Similarity Matrices

In some supervised learning situations, input samples themselves are not provided, but only the similarity between input points can be observed. We can use kernel methods if the similarity matrix is positive semi-definite, but the similarity matrices stemmed from real problems do not necessarily posses positive semi-definiteness. To cope with this problem, we derived a necessary condition of similarity matrices for achieving a better generalization performance and proposed a boosting-type algorithm based on the theory (Neural Computation, 2009).### Learning under Different Distributions

In standard supervised learning, it is commonly assumed that the samples used for training follows the same probability distribution as the test samples. However, this assumption is not always satisfied in practice. For example, when the data generation mechanism is non-stationary, training and test distributions may be different. In such situations, just using training samples as they are does not provide a better generalization performance due to the distribution difference.We co-organized the NIPS2006 Workshop on this issue and edited a book (Dataset Shift in Machine Learning). Furthermore, we published a monograph on covariate shift adaptation (book: coming soon).

#### Covariate Shift Adaptation

The situation where the input distribution is different in training and test phases but the functional relation remains unchanged is called the covariate shift. Note that the covariate is the name of the input variable in statistics. A typical example of covariate shift scenarios would be an extrapolation problem where output values at a region with only a few training samples are learned. When active learning is carried out, covariate shift is naturally induced. The problem of covariate shift adaptation is to learn the target function (which is unchanged) given input-output training samples and input-only test samples (The Brain & Neural Networks, 2006, in Japanese), (Image Lab, 2007, in Japanese), (article in a book).The unbiasedness of model selection criteria including the subspace information criterion (SIC) is no longer maintained under the covariate shift. To cope with this problem, we proposed an extended SIC called importance weighted SIC (IWSIC) that is still unbiased even under the covariate shift (Statistics & Decisions, 2005).

IWSIC is specialized for the squared loss function, so is not applicable to classification tasks. Cross validation is a standard model selection technique in classification scenarios, but its unbiasedness is lost under the covariate shift. We therefore developed an extended cross validation method called importance weighted cross validation (IWCV) that works desirably in covariate shifted classification tasks (Journal of Machine Learning Research, 2007).

Furthermore, we theoretically elucidated the effect of covariate shift on Baysian inference with singular models (ICML2007).

The effectiveness of covariate shift adaptation methods has been demonstrated in various real-world problems such as brian-computer interface, robot control, speech processing, natural language processing, and computer vision.

For the above series of works on covariate shift adaptation, the IBM Faculty Award is given by in 2007.

#### Domain Adaptation / Transfer Learning

Domain adaptation / transfer learning deals with the situation where the functional relations also changes between training and test phases in addition to the input distributions. Without any restrictions on the distribution change, the test function is nothing related to the training samples and therefore we may not be able to learn anything about the test function from the training samples. Thus, in order to make a meaningful discussion, we need a reasonable restriction on the distribution change.We consider a restriction where the training input-output distribution is a mixture of two distributions and the test input-output distribution is one of the two distributions. Under this restriction, we developed a learning method that can learn the test function given input-output training samples and input-only test samples (NIPS2006).

We developed a transfer learning method which enforces the target solution to be similar to related source solutions, and showed its effectiveness in biological network inference (International Journal of Knowledge Discovery in Bioinformatics, 2010).

#### Multi-task Learning

When we have several learning tasks, solving them together could be better than solving them individually since common information could be shared by several related tasks. This could be interpreted as performing domain adaptation over several tasks.Under the setting that (a small number of) input-output samples are given for every tasks, we developed a method to simultaneously train multiple support vector machines by sharing common information (IEEE Transactions on Knowledge and Data Engineering, 2010).

### Importance Estimation / Density Ratio Estimation

The importance is the ratio of two probability density functions. The problem of importance estimation is to estimate the importance from two sets of samples from drawn from two probability distributions (book: coming soon).The values of the importance could be simply estimated by estimating the probability density functions and then taking the ratio of the estimated densities. However, it is hard to estimate the density functions accurately and therefore such a naive approach is not suitable. We proposed a method called Kullback-Leibler importance estimation procedure (KLIEP), which directly models the importance function and does not involve density estimation (Annals of the Institute of Statistical Mathematics, 2008), (software). Then we developed extended KLIEP methods, log-linear KLIEP (LL-KLIEP) (Journal of Information Processing, 2009), Gaussian-mixture KLIEP (GM-KLIEP) (IEICE Transactions on Information and Systems, 2009), and Probabilistic-PCA-mixture KLIEP (PM-KLIEP) (IEICE Transactions on Information and Systems, 2010). We also developed importance estimation methods based on the squaredloss, least-squares importance fitting (LSIF) and unconstrained LSIF (uLSIF) (Journal of Machine Learning Research, 2009), (software). In LSIF, the regularization path can be computed efficiently, so the computation time including model selection can be significantly reduced. In uLSIF, an importance estimator as well as the leave-one-out score can be calculated analytically, so it is computationally very efficient. We have also proved that uLSIF has an excellent stability property in terms of condition number analysis (coming soon).

For accurately estimating the density ratio in high-dimensional spaces, dimensionality reduction is critical. We named the subspace in which the numerator and denominator densities in the density ratio are different the hetero-distributional subspace, and proposed a new framework for estimating the density ratio with the hetero-distributinal subspace search called the Direct Density ratio estimation with Dimensionality Reduction (D3; D-cube). We developed a computationally efficient D3 algorithm which combines Local Fisher Discriminant Analysis (LFDA) (Journal of Machine Learning Research, 2007) and unconstrained Least-Squares Importance Fitting (uLSIF)(Journal of Machine Learning Research, 2009). Furthermore, we proposed an algorithm called the Least-squares Hetero-distributional Subspace Search (LHSS), which can identify the subspace in which the discrepancy between the numerator and denominator densities is maximized (SDM2010).

Density ratio estimation has various applications such as non-stationality adaptation, outlier detection, mutual information estimation, and conditional probability estimation. See the review paper for details (IPSJ Transactions, 2009), (RIMS Kokyuroku, 2010), (Proceedings of the Institute of Statistical Mathematics, 2010, in Japanese).

### Mutual Information Estimation / Divergence Estimation

Mutual information is an important quantity in information theory and is equivalent to the Kullback-Leibler divergence from a joint distribution to the product of marginals. Thus mutual information can be used for measuring statistical independence among random variables. Based on density ratio estimators, we proposed mutual information estimators called maximum likelihood mutual information (MLMI) (FSMI2008), (software) and its squared-loss variant called least-squares mutual information (LSMI) (BMC Bioinformatice, 2009), (software).Mutual information estimators can be used for various purposes such as dimensionality reduction, independent component analysis, and causal inference.

### Outlier Detection

Outlier detection is an unsupervised learning problem to find `irregular' samples in a data set. Within a statistical formulation, a sample is regarded as an outlier if its probability density value is smaller than a threshold. Thus, if the probability density function of the data samples is identified, the statistical outlier problem can be completely solved. However, density estimation is known to be a hard task particularly in high dimensional problems and therefore we want to avoid density estimation in outlier detection.We consider an outlier detection problem where we want to identify outliers in a test data set based on the training data set that do not contain outliers. For this problem, we proposed an outlier detection method that uses the importance as a novelty score. Since the importance values could be obtained without going through density estimation, the proposed method works very well even in high dimensional problems (Knowledge and Information Systems, 2011), (software1), (software2).

### Independent Component Analysis

Independent component analysis (ICA) is a problem of separating mixed signals into statistically independent ones, and has been extensively studied in, e.g., speeach signal processing. In the ICA research, how to evaluate statistical independence among signals is a key issue. We proposed an ICA algorithm, called least-squares independent component analysis (LICA), which is based on a mutual information approximator using a density ratio estimator (Neural Computation, 2011).### Causal Inference

Given a paired random variable (X,Y), it is very important to test whether X results in Y, Y results in X, or there is no causal relation between X and Y. We proposed a causal inference algorithm, called least-squares independence regression (LSIR), which is based on a mutual information approximator using a density ratio estimator (AAAI2010).### Conditional Probability Estimation

Estimating the conditional density when the condition variable is continuous is not a straightforward task. We proposed a method of estimating conditional densities called the least-squares conditional density estimator (LSCDE) (IEICE Transactions on Information and Systems, 2010) which is useful especially for multi-dimensional continous variables. LSCDE is based on a density ratio estimator. A similar idea can be applied to probabilitic pattern recognition, where the class-posterior probability is estimated in the classification scenario. Our density-ratio based method, called least-squares probabilitic classifier (LSPC), is faster than alternative approaches in training time by two orders of magnitude, with pattern recognition accuracy (IEICE Transactions on Information and Systems, 2010), (software).### Risk-sensitive Learning

Standard supervised learning methods such as least-squares learn the function such that the mean error for training samples is minimized. This guarantees the mean generalization performance of learning machines, but this does not necessarily mean that the generalization performance for the single dataset at hand is high. To appropriately handle such a risk, it is necessary to design learning algorithm so that not only the mean error, but also the error distribution is taken into account.We proved that the boosting algorithm control the error distribution properly and gave a new theoretical explanation for its success (COLT2008). Furthermore, we proved that the nu-support vector machine optimizes an error measure called the conditional value-at-risk (CVaR) and elucidated its generalization performance theoretically (New Generation Computing, 2009).

Risk controls is also important in the framework of reinforcement learning for sequential decision making. We developed reinforcement learning algorithms for optimizing various risk measures (IEICE Transactions on Information and Systems, 2010). Furthermore, we proposed a new framework for estimating the sum of rewards obtained in the future and developed various algorithms for risk management (UAI2010), (ICML2010).

### Low-rank Approximation

When a matrix is noisy or has missing values, approximating the target matrix by a low-rank one can give de-noised or missing-imputed matrix.Since approximating the target matrix with a low-rank constraint results in a non-convex optimization problem, obtaining a good solution is difficult in practice. To avoid this problem, the trace-norm regularizer is useful. The trace-norm regularizer is a convex function, and works as the l1-regularizer on singular values. Thus, it tends to produce sparse singular values and thus a low-rank matrix via convex optimization. However, optimization with trace-norm regularizers is still challenging for large-scale problems due to its non-differentiability. To cope with this problem, we proposed an efficient optimization algorithm for trace-norm regularization based on dual augmented Lagrangian (ICML2010).

Another popular formulation of low-rank approximation is matrix factorization: a matrix is decomposed into the product of two `thin' matrices. The matrix factorization model is non-identifiable, meaning that the decomposition is redundant. We showed that, in the variational Bayesian framework, the non-identifiability of the matrix factorization model induces a regularization effect even when the prior distributions are flat, which is a notably different behavior from identifiable models (ICML2010). The solution of variational Bayesian matrix factorization is usually computed by an iterative algorithm. Although the optimization problem is non-convex, we showed that the global optimal solution can be computed analytically (NIPS2010).

## Applications of Machine Learning

### Precipitation Estimation

We applied the subspace information criterion (SIC) in precipitation estimation and won the first prize in prediction accuracy at 2001 IEICE Precipitation Estimation Contest (Proceedings of the 2001 IEICE General Conference, 2001).### Image Restoration

The problem of restoring degraded images by noise or blur is mathematically equivalent to supervised learning. We formulated the problem of optimizing image restoration filters as a model selection problem and proposed a filter design method based on SIC (IEICE Transactions on Information and Systems, 2001). The paper received high praise because of its originality and usefulness and we received the Niwa Memorial Award from Tokyo Denki University in 2002. We further developed an integrated filter design method (Signal Processing, 2002).In image restoration, it is sometimes not sufficient to just obtain subjectively clear images; for example, when the face recorded by a security camera is restored, we should not add any subjective artifact in the image. In such scenarios, it is effective to model the degradation process and restore the image following the modeled process. Based on this idea, we restored the image of a historically important camera and contributed to elucidating the Japanese camera history (Technical Report, 2004).

### Brain-Computer Interface

Brain-Computer Interface (BCI) is a user interface that allows to control computers with brain signals. BCI allows paralyzed patients to use computers comfortably. Even ordinary people can have a comfortable computer environment with the help of brain signals. Because of the time resolution, EEG signals are often used for BCI. Namely, EEG signals which contain users' intention (e.g., left or right) are classified into different control commands by a supervised learning technique. However, the non-stationarity of the brain causes a significant drift in the EEG signals between training and test phases so standard supervised learning methods do not work appropriately.We modeled the non-stationarity of the brain as the covariate shift and applied IWCV. Then the BCI recognition accuracy has been highly improved (Journal of Machine Learning Research, 2007), (IEEE Transactions on Biomedical Engineering, 2010).

### Surface Profiler

Optical interference microscopes allow to measure surface profile at the scale of nanometers. The surface profiling by optical interference microscopes can be regarded as an example of supervised learning.Most of the existing surface profiling algorithms need to measure the interference level multiple times with the phase of the light changed. Surface profilers are often used in an environment with vibration such as factories. In such environment, the measurement accuracy degrades significantly if measurement is carried out multiple times. To cope with this problem, we developed an efficient and accurate method called local model fitting (LMF) that allows us to recover the surface profile only from a single measurement (Applied Optics, 2006). We further provided theoretical error analysis of the LMF method and proposed an improved method, interpolated LMF (iLMF) (Applied Optics, 2009). We also developed the iteratively-reweighted LMF (IRLMF), which can adaptively determine the local regions in LMF (Applied Optics, 2010).

Surface profiling algorithms based on monochromatic light sources, including the above LMF method, generally have a strong limitation on the maximum height of steps to be measured due to the phase periodicity. The maximum height can be extended by using multiple monochromatic light sources with different wave length; however, this requires multiple measurements and thus the advantage of one-shot methods such as LMF is spoiled. We developed a method that enables us to achieve this with a single measurement (Journal of the Japan Society for Precision Engineering, 2009, in Japanese). For this work, we won the Odawara Award 2nd Prize at the Vision Engineering Workshop (ViEW2007) in 2007.

When the target object is covered by a film layer, existing surface profiling algorithms do not give accurate results because of the refraction; furthermore, the thickness of the film can not be measured. To cope with this problem, we developed an algorithm that can simultaneously measure the surface profile and the thickness of the film: white light (Transactions of the Society of Instrument and Control Engineers, 2007, in Japanese), multiple-shot monochromatic light (Transactions of the Society of Instrument and Control Engineers, 2009, in Japanese), and one-shot monochromatic light (Journal of the Japan Society for Precision Engineering, 2009, in Japanese).

### Robot Control

Appropriately Designing basis functions in value function approximation is an important problem of reinforcement learning. We applied the geodesic Gaussian kernel (GGK) to Khepera robot navigation tasks and showed that the proposed method provides faster learning than conventional method (Autonomous Robots, 2008), (demo).We further showed that an update of robot control policies through online learning processes could be formulated as covariate shift. We applied the importance weighted cross validation (IWCV) method in value function approximation and proposed sample-reuse policy iteration (SRPI) (Neural Networks, 2009). We further applied a similar idea to a direct policy learning method and proposed reward-weighted regression with sample reuse (R^3) (ECML-PKDD2009).

Collecting data samples is often costly in real-world robotics and therefore designing good sampling policy is highly important. We applied our active learning method, ALICE, to sampling policy design in robot control and proposed active policy iteration (API) (Neural Networks, 2010).

In practical robot-cotrol problemss, reward samples are often measured by sensors and thus contain outliers. We applied a robust regression technique to policy iteration and proposed least absolute policy iteration (LAPI). (IEICE Transactions on Information and Systems, 2010) (demo).

### Semi-conductor Wafer Alignment

Recent semi-conductors have the layered circuit structure, which are built by exposing circuit patterns multiple times. In this process, it is extremely important to align the wafer at the same position with very high accuracy. To this end, the location of markers are measured to adjust the shift and rotation of wafers. However, measuring the location of markers is time-consuming and therefore there is a strong need to reduce the number of markers to measure for speeding up the semi-conductor production process. We applied our active learning method, PALICE, to the wafer alignment problem and showed its usefulness (Machine Learning, 2009).### Speech Processing

The goal of speaker identification is to predict who the speaker is from speech signals. In speaker identification, non-stationarity caused by changing environment and emotion is often a source of performance degradation. We formulated speaker identification with non-stationarity as classification under covariate shift, and proposed a variant of logistic regression for non-stationarity adaptation (Signal Processing, 2010).### Natural Language Processing

The problem of word segmentation of Japanese sentences is a fundamental task of natural language processing. While Labeled samples for general conversation corpora are readily available, data in specialized domains such as medical science are not easily available. We applied the covariate shift adaptation techniques and showed that the word segmentation accuracy of medical documents can be improved by adapting the general conversation data to the medical domain (Journal of Information Processing, 2009).Document classification involves super-high dimensional sparse data, so dimensionality reduction is essential. We applied the semi-supervised dimensionality reduction method, SELF, to document classification, and showd that the classification accuracy is highly improved (IEICE Transactions on Information and Systems, 2009), (Machine Learning, 2010), (software).

We developed an efficient thesaurus construction method based on unsupervised dimensionality reduction (IEICE Transactions on Information and Systems, 2010).

### Computer Vision

Predicting age from face images has attracted a great deal of attention these days since it is useful for designing effective marketing strategies. However, manually labeling face images is laborious, so we want to utilize unlabeled image data (i.e., semi-supervised learning). We developed an active learning strategy based on the cluster structure of unlabeled samples and a regression method based on the characteristic of human perception (IEICE Transactions on Information and Systems, 2010).Face images possesses high diversity due to angles and lighting conditions, so training and test images often follow different distributions. We developed a new age prediction method based on the covariate shift adaptation techniques (ICPR2010).