- Types of Learning
- Theories and Algorithms of Machine Learning
- Model Selection
- Active Learning
- Incremental Learning / Online Learning
- Dimensionality Reduction
- Learning from Similarity Data / Kernel Methods
- Learning under Different Distributions
- Importance Estimation / Density Ratio Estimation
- Mutual Information Estimation / Divergence Estimation
- Outlier Detection
- Independent Component Analysis
- Causal Inference
- Conditional Probability Estimation
- Risk-sensitive Learning
- Low-rank Approximation
- Applications of Machine Learning
In semi-supervised learning, the number of input-output samples is typically small, while the number of input-only samples is very large. In such situations, the generalization performance may be further improve by utilizing a large number of input-only samples in addition to a small number of input-output samples.
A typical approach to reinforcement learning does not directly learn the policy function itself; it learns the value function, a mapping from a state (and an action) to the sum of rewards. The desired policy function can be generated based on the value function. (book in Japanese: Training Robotic Game Players by Reinforcement Learning), (book: coming soon).
supervised learning, it is important to determine the complexity of learning machines (models) appropriately. If the model is too simple, it is not flexible enough to represent the learning target function and therefore the generalization performance can not be improved even if a large amount of training data is employed for learning. On the other hand, if the model is too complex, it is capable of representing the target function, but such a complex model is heavily affected by the noise in the training data. Therefore, in practical situations with a rather small number of training samples, complex models do not give better generalization performance.
Model selection is one of the most fundamental and central research topics in supervised learning. Model selection is generally carried out by finding the model such that the generalization performance is maximized. Therefore, the key of model selection research is how to derive an accurate estimator of the generalization performance.
(Neural Computation, 2001). SIC has shown to work well because of its data-dependence; thus SIC tends to outperform existing methods in single trials (Machine Learning, 2002). SIC was originally proposed as a criterion for selecting subspace models. Later, we have extended SIC to be able to choose the regularization parameter in linear ridge regression (Neural Networks, 2002), the regularization parameter in sparse linear regression (IEEE Transactions on Neural Networks, 2002), and the regularization parameter in kernel ridge regression (Journal of Machine Learning Research, 2002). Furthermore, we theoretically evaluated the performance of SIC when the training input points also contain noise (Neural Information Processing - Letters and Reviews, 2004), and extended the range of application of SIC so that non-linear parameter learning methods are also allowed (IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 2007). For the above research works, the Young Researcher Award was given from Japanese Neural Network Society (JNNS) in 2001, the Information Technology Award from Funai Foundation for Information Technology in 2003, the Nakamura Research Award from Tejima Foundation for Industry and Education in 2003. (IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 2003). Furthermore, we developed the regularized SIC (RSIC) which can more significantly reduces the variance of SIC by slightly sacrificing its unbiasedness (Neural Computation, 2004). RSIC remarkably improves the stability and reliability of model selection. We also proposed methods of analytically obtaining the best model under RSIC that can improve the model selection accuracy and computational efficiency at the same time; for shrinkage learning (IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 2006) and for adaptive ridge learning (IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 2007). RSIC includes a regularization parameter in the model selection criterion, which is optimized based on a meta model selection criterion. We improved the meta criterion and showed that the model selection performance is further improved (IEICE Transactions on Information and Systems, 2007).
supervised learning, one can obtain better generalization capability with a small amount of training data. The problem of designing training input location is called active learning (or experimental design). The challenge of active learning is to predict the generalization performance of a learning machine before training data is gathered. Therefore, active learning is thought to be a more difficult problem than model selection; particularly how to cope with the bias of the estimator, which can not be generally predicted without accessing to training data, is the key issue in active learning.
(Neural Computation, 2000). This method improves the performance by explicitly suppressing the increase of the variance in the first stage with guaranteeing that the bias vanishes. The thesis that contains this work was selected for the Doctor Dissertation Award by Tejima Foundation for Industry and Education in 2002.
The above two-stage method is practically useful. However, it is an incremental method, i.e., the input location is determined one by one in a sequential manner. Therefore, it can only achieve the greedy optimality. Theoretically, batch active learning---designing all input locations at the same time---is globally optimal. Following this fact, we developed a globally optimal batch active learning method when the learning target function is included in trigonometric polynomial models, and elucidated its theoretical properties and geometric structure (IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 2001). For this achievement, we received the Young Engineer Award from The Institute of Electronics, Information and Communication Engineers (IEICE) in 2002.
In the above methods, the model at hand is assumed to be correct. However, this assumption may not be satisfied in practice. Therefore, developing an active learning method applicable to misspecified models (i.e., the learning target function is not included in the model) is practically very important--but is known to be a hard problem since it is difficult to control the bias of misspecified models. For misspecified models, importance-weighted least-squares learning is proved to give an asymptotic unbiased estimator. Based on this fact, we proposed a batch active learning method called ALICE (Journal of Machine Learning Research, 2006). ALICE is guaranteed to give a minimum-variance design of input locations under the constraint that the bias is asymptotically minimized. ALICE has been applied to efficient design of sampling policies in robot control.
ALICE is an effective active learning method when we are allowed to set the input points at any desired location. In some practical situations, however, we are only given a finite number of input points and have to choose the best input points among the candidate points. Such a scenario is called pool-based active learning. We extended ALICE to pool-based ALICE (P-ALICE) and showed its usefulness through extensive simulations (Machine Learning, 2009). PALICE has been applied to semi-conductor wafer alignment in in semiconductor exposure apparatus.
For this challenging problem, we proved that there exist the common optimal training input location for trigonometric polynomial models. Based on this result, we developed an algorithm for active learning with model selection for trigonometric polynomial models that can successfully avoid the model selection/active learning dilemma (IEICE Transactions on Information and Systems, 2003).
Furthermore, we extended the above method to accommodate a general class of models and developed the method of ensemble active learning (Neural Networks, 2009).
supervised learning, it is computationally efficient to build the learning result upon the previously learned results. Learning in such an incremental manner is called incremental learning (or on-line learning).
We developed an incremental learning method called incremental projection learning (IPL), that only relies on incremental computation but provides exactly the same learning result as that obtained by batch learning with all training samples (Neural Networks, 2001a). We also elucidated theoretical properties of IPL (Neural Networks, 2001b). IPL forms a basis of developing incremental active learning algorithms. Furthermore, we proposed an IPL-based method of constructing neural networks (IEICE Transactions on Information and Systems, 2002). For the above series of works on incremental learning, the Inose Science Promotion Award was given by the Foundation of Electrical, Electronics, and Information Science Promotion in 1998.
Online learning is also important in the context of density ratio estimation. We developed an online version of the density ratio estimation algorithm KLIEP, and showed its effectiveness through change-point detection experiments (SDM2009).
Discriminant AnalysisA typical method of supervised dimensionality reduction in classification scenarios would be Fisher discriminant analysis (FDA). When samples in each category follows the normal distribution with common covariance structure, FDA provides the optimal solution. However, samples in some category has separate clusters (i.e., multi-modal), FDA does not give desired solutions. To cope with this problem, we proposed local Fisher discriminant analysis (LFDA) which works well with multi-modal data (Journal of Machine Learning Research, 2007), (software). For this work, the Incentive Award was given by Japanese Society for Artificial Intelligence (JSAI) in 2007.
Sufficient Dimension ReductionSufficient dimension reduction is an approach of supervised dimensionality reduction for both regression and classification, aimed at finding an input subspace which is independent of outputs. Based on a method of density-ratio estimation and developed a sufficient dimension reduction method (AISTATS2010). semi-supervised learning settings where input-only samples are given in addition to input-output samples, we expect to obtain better dimensionality reduction performance by utilizing the input-only samples.
For semi-supervised classification scenarios, we developed semi-supervised local Fisher discriminant analysis (SELF) by combining principal component analysis (unsupervised dimensionality reduction) and local Fisher discriminant analysis (supervised dimensionality reduction) (Machine Learning, 2010), (software). Furthermore, for very high-dimensional sparse massive datasets, we developed a computational efficient implementation called sparse SELF (SSELF), and showed its usefulness in document classification, an important task in natural language processing (IEICE Transactions on Information and Systems, 2009), (software). unsupervised learning scenarios. The goal of unsupervised dimensionality reduction is to obtain a low dimensional representation of the data while preserving the intrinsic information contained in the original data. However, this problem is not mathematically well defined since the meaning of intrinsic information is rather vague. Therefore, it is a cumbersome problem.
Non-Gaussian Component AnalysisWe focused on the situation where the original data can be decomposed into signal components included in a low dimensional linear subspace and noise components since this enables us to rigorously formulate the unsupervised dimensionality reduction problem. If the signal components follow some non-Gaussian distribution while the noise components follow the Gaussian distributions, the projection pursuit (PP) algorithm can identify the signal components. PP needs to prefix a single criterion called projection index that measures the non-Gaussianity of the data. However, it is known that a standard projection index is suitable for searching super-Gaussian distributions while another typical projection index is effective in finding sub-Gaussian distributions. Therefore, if the data contains both super- and sub-Gaussian signal components, it is not obvious how to optimally choose the projection index.
To cope with this problem, we proposed a general semi-parametric framework of unsupervised dimensionality reduction called non-Gaussian component analysis (NGCA). Within the NGCA framework, we proposed an algorithm called multi-index projection pursuit (MIPP) (Journal of Machine Learning Research, 2006). MIPP allows us to employ a number of projection indices at the same time. Therefore, one can reduce the dimensionality of the data even when it contains both super- and sub-Gaussian components.
Although MIPP can overcome the traditional weakness of PP, it is not still the optimal method within the NGCA framework. We further developed an algorithm called the iterative metric adaptation for radial kernel functions (IMAK) (Annals of the Institute of Statistical Mathematics, 2007). IMAK can estimate the signal subspace more accurately than MIPP and PP.
We also applied the NGCA method to the noise reduction problem and developed an algorithm for accurately estimating the best linear unbiased estimator (IEICE Transactions on Information and Systems, 2008).
Direct Density ratio estimation with Dimensionality ReductionDimensionality reduction is also an important topic in the context of importance estimation / density ratio estimation. We proposed a framework called Direct Density ratio estimation with Dimensionality Reduction (D3; D-cube) for finding the subspace in which the numerator and denominator densities in the density ratio are different. We developed a computationally efficient D3 algorithm which combines Local Fisher Discriminant Analysis (LFDA) (Journal of Machine Learning Research, 2007) and unconstrained Least-Squares Importance Fitting (uLSIF)(Journal of Machine Learning Research, 2009). Furthermore, we proposed an algorithm called the Least-squares Hetero-distributional Subspace Search (LHSS) which can identify the subspace in which the discrepancy between the numerator and denominator densities is maximized (Neural Networks, 2011).
supervised learning, but it is possible to use similarities among input sample points. The matrix that contains all pairwise similarities between input sample points is called the similarity matrix. When the similarity matrix is defined through positive semi-definite kernels, it is called the kernel matrix and learning algorihtms using kernel matirces are commonly referred to as kernel methods. dimensionality reduction>), KLIEP, LSIF, uLSIF(density-ratio estimation). (AISTATS2009).
The nu support vector machine is a powerful kernel-based classificaiton algorithm. However, under strict formulation, the nu support vector machine involves a non-convex optimization problem and there is no known method for obtaining its global optimal solution. Based on cutting plane mehtods, we developed a new optimization algorithm which is guaranteed to find the global optimal solution (New Generation Computing, 2009).
We also published a review paper on recent advances in large-scale kernel methods (IEICE Transactions on Information and Systems, 2009).
Sparse learning methods are incresingly popular these days due to its high generalization performance with a large number of explanatory variables. We developed an augmented Lagrangian algorithm for efficient training of sparse regularization models (IEEE Signal Processing Letters, 2009) robot control and showed its usefulness (ICRA2009).
Gaussian kernels are general-purpose similarity functions and give good results on the whole. However, if some prior knowledge of the problem domain is available, the generalization performance and computational efficiency could be improved by utilizing the prior knowledge.
We proposed the principal component kernel (PCK) which can be computed based on the prior distribution of the learning target function (IEICE Transactions on Information and Systems, 2006). We constructed a specific PC kernel for the binary regression problem and showed its effectiveness.
Kernel functions are also often employed in value function approximation in reinforcement learning. We proposed the geodesic Gaussian kernel (GGK), which can be constructed based on the manifold structure of the state space. We demonstrated the effectiveness of GGKs in robot arm control and Khepera robot navigation tasks (Autonomous Robots, 2008), (demo). (IEEE Transactions on Neural Networks, 2009). supervised learning situations, input samples themselves are not provided, but only the similarity between input points can be observed. We can use kernel methods if the similarity matrix is positive semi-definite, but the similarity matrices stemmed from real problems do not necessarily posses positive semi-definiteness. To cope with this problem, we derived a necessary condition of similarity matrices for achieving a better generalization performance and proposed a boosting-type algorithm based on the theory (Neural Computation, 2009).
supervised learning, it is commonly assumed that the samples used for training follows the same probability distribution as the test samples. However, this assumption is not always satisfied in practice. For example, when the data generation mechanism is non-stationary, training and test distributions may be different. In such situations, just using training samples as they are does not provide a better generalization performance due to the distribution difference.
We co-organized the NIPS2006 Workshop on this issue and edited a book (Dataset Shift in Machine Learning). Furthermore, we published a monograph on covariate shift adaptation (book: coming soon). active learning is carried out, covariate shift is naturally induced. The problem of covariate shift adaptation is to learn the target function (which is unchanged) given input-output training samples and input-only test samples (The Brain & Neural Networks, 2006, in Japanese), (Image Lab, 2007, in Japanese), (article in a book).
The unbiasedness of model selection criteria including the subspace information criterion (SIC) is no longer maintained under the covariate shift. To cope with this problem, we proposed an extended SIC called importance weighted SIC (IWSIC) that is still unbiased even under the covariate shift (Statistics & Decisions, 2005).
IWSIC is specialized for the squared loss function, so is not applicable to classification tasks. Cross validation is a standard model selection technique in classification scenarios, but its unbiasedness is lost under the covariate shift. We therefore developed an extended cross validation method called importance weighted cross validation (IWCV) that works desirably in covariate shifted classification tasks (Journal of Machine Learning Research, 2007).
Furthermore, we theoretically elucidated the effect of covariate shift on Baysian inference with singular models (ICML2007).
The effectiveness of covariate shift adaptation methods has been demonstrated in various real-world problems such as brian-computer interface, robot control, speech processing, natural language processing, and computer vision.
For the above series of works on covariate shift adaptation, the IBM Faculty Award is given by in 2007.
We consider a restriction where the training input-output distribution is a mixture of two distributions and the test input-output distribution is one of the two distributions. Under this restriction, we developed a learning method that can learn the test function given input-output training samples and input-only test samples (NIPS2006).
We developed a transfer learning method which enforces the target solution to be similar to related source solutions, and showed its effectiveness in biological network inference (International Journal of Knowledge Discovery in Bioinformatics, 2010).
Under the setting that (a small number of) input-output samples are given for every tasks, we developed a method to simultaneously train multiple support vector machines by sharing common information (IEEE Transactions on Knowledge and Data Engineering, 2010).
(book: coming soon).
The values of the importance could be simply estimated by estimating the probability density functions and then taking the ratio of the estimated densities. However, it is hard to estimate the density functions accurately and therefore such a naive approach is not suitable. We proposed a method called Kullback-Leibler importance estimation procedure (KLIEP), which directly models the importance function and does not involve density estimation (Annals of the Institute of Statistical Mathematics, 2008), (software). Then we developed extended KLIEP methods, log-linear KLIEP (LL-KLIEP) (Journal of Information Processing, 2009), Gaussian-mixture KLIEP (GM-KLIEP) (IEICE Transactions on Information and Systems, 2009), and Probabilistic-PCA-mixture KLIEP (PM-KLIEP) (IEICE Transactions on Information and Systems, 2010). We also developed importance estimation methods based on the squaredloss, least-squares importance fitting (LSIF) and unconstrained LSIF (uLSIF) (Journal of Machine Learning Research, 2009), (software). In LSIF, the regularization path can be computed efficiently, so the computation time including model selection can be significantly reduced. In uLSIF, an importance estimator as well as the leave-one-out score can be calculated analytically, so it is computationally very efficient. We have also proved that uLSIF has an excellent stability property in terms of condition number analysis (coming soon).
For accurately estimating the density ratio in high-dimensional spaces, dimensionality reduction is critical. We named the subspace in which the numerator and denominator densities in the density ratio are different the hetero-distributional subspace, and proposed a new framework for estimating the density ratio with the hetero-distributinal subspace search called the Direct Density ratio estimation with Dimensionality Reduction (D3; D-cube). We developed a computationally efficient D3 algorithm which combines Local Fisher Discriminant Analysis (LFDA) (Journal of Machine Learning Research, 2007) and unconstrained Least-Squares Importance Fitting (uLSIF)(Journal of Machine Learning Research, 2009). Furthermore, we proposed an algorithm called the Least-squares Hetero-distributional Subspace Search (LHSS), which can identify the subspace in which the discrepancy between the numerator and denominator densities is maximized (SDM2010).
Density ratio estimation has various applications such as non-stationality adaptation, outlier detection, mutual information estimation, and conditional probability estimation. See the review paper for details (IPSJ Transactions, 2009), (RIMS Kokyuroku, 2010), (Proceedings of the Institute of Statistical Mathematics, 2010, in Japanese).
density ratio estimators, we proposed mutual information estimators called maximum likelihood mutual information (MLMI) (FSMI2008), (software) and its squared-loss variant called least-squares mutual information (LSMI) (BMC Bioinformatice, 2009), (software).
Mutual information estimators can be used for various purposes such as dimensionality reduction, independent component analysis, and causal inference.
unsupervised learning problem to find `irregular' samples in a data set. Within a statistical formulation, a sample is regarded as an outlier if its probability density value is smaller than a threshold. Thus, if the probability density function of the data samples is identified, the statistical outlier problem can be completely solved. However, density estimation is known to be a hard task particularly in high dimensional problems and therefore we want to avoid density estimation in outlier detection.
We consider an outlier detection problem where we want to identify outliers in a test data set based on the training data set that do not contain outliers. For this problem, we proposed an outlier detection method that uses the importance as a novelty score. Since the importance values could be obtained without going through density estimation, the proposed method works very well even in high dimensional problems (Knowledge and Information Systems, 2011), (software1), (software2).
mutual information approximator using a density ratio estimator (Neural Computation, 2011).
mutual information approximator using a density ratio estimator (AAAI2010).
(IEICE Transactions on Information and Systems, 2010) which is useful especially for multi-dimensional continous variables. LSCDE is based on a density ratio estimator. A similar idea can be applied to probabilitic pattern recognition, where the class-posterior probability is estimated in the classification scenario. Our density-ratio based method, called least-squares probabilitic classifier (LSPC), is faster than alternative approaches in training time by two orders of magnitude, with pattern recognition accuracy (IEICE Transactions on Information and Systems, 2010), (software).
We proved that the boosting algorithm control the error distribution properly and gave a new theoretical explanation for its success (COLT2008). Furthermore, we proved that the nu-support vector machine optimizes an error measure called the conditional value-at-risk (CVaR) and elucidated its generalization performance theoretically (New Generation Computing, 2009).
Risk controls is also important in the framework of reinforcement learning for sequential decision making. We developed reinforcement learning algorithms for optimizing various risk measures (IEICE Transactions on Information and Systems, 2010). Furthermore, we proposed a new framework for estimating the sum of rewards obtained in the future and developed various algorithms for risk management (UAI2010), (ICML2010).
Since approximating the target matrix with a low-rank constraint results in a non-convex optimization problem, obtaining a good solution is difficult in practice. To avoid this problem, the trace-norm regularizer is useful. The trace-norm regularizer is a convex function, and works as the l1-regularizer on singular values. Thus, it tends to produce sparse singular values and thus a low-rank matrix via convex optimization. However, optimization with trace-norm regularizers is still challenging for large-scale problems due to its non-differentiability. To cope with this problem, we proposed an efficient optimization algorithm for trace-norm regularization based on dual augmented Lagrangian (ICML2010).
Another popular formulation of low-rank approximation is matrix factorization: a matrix is decomposed into the product of two `thin' matrices. The matrix factorization model is non-identifiable, meaning that the decomposition is redundant. We showed that, in the variational Bayesian framework, the non-identifiability of the matrix factorization model induces a regularization effect even when the prior distributions are flat, which is a notably different behavior from identifiable models (ICML2010). The solution of variational Bayesian matrix factorization is usually computed by an iterative algorithm. Although the optimization problem is non-convex, we showed that the global optimal solution can be computed analytically (NIPS2010).
2001 IEICE Precipitation Estimation Contest (Proceedings of the 2001 IEICE General Conference, 2001).
supervised learning. We formulated the problem of optimizing image restoration filters as a model selection problem and proposed a filter design method based on SIC (IEICE Transactions on Information and Systems, 2001). The paper received high praise because of its originality and usefulness and we received the Niwa Memorial Award from Tokyo Denki University in 2002. We further developed an integrated filter design method (Signal Processing, 2002).
In image restoration, it is sometimes not sufficient to just obtain subjectively clear images; for example, when the face recorded by a security camera is restored, we should not add any subjective artifact in the image. In such scenarios, it is effective to model the degradation process and restore the image following the modeled process. Based on this idea, we restored the image of a historically important camera and contributed to elucidating the Japanese camera history (Technical Report, 2004).
supervised learning technique. However, the non-stationarity of the brain causes a significant drift in the EEG signals between training and test phases so standard supervised learning methods do not work appropriately.
We modeled the non-stationarity of the brain as the covariate shift and applied IWCV. Then the BCI recognition accuracy has been highly improved (Journal of Machine Learning Research, 2007), (IEEE Transactions on Biomedical Engineering, 2010).
Most of the existing surface profiling algorithms need to measure the interference level multiple times with the phase of the light changed. Surface profilers are often used in an environment with vibration such as factories. In such environment, the measurement accuracy degrades significantly if measurement is carried out multiple times. To cope with this problem, we developed an efficient and accurate method called local model fitting (LMF) that allows us to recover the surface profile only from a single measurement (Applied Optics, 2006). We further provided theoretical error analysis of the LMF method and proposed an improved method, interpolated LMF (iLMF) (Applied Optics, 2009). We also developed the iteratively-reweighted LMF (IRLMF), which can adaptively determine the local regions in LMF (Applied Optics, 2010).
Surface profiling algorithms based on monochromatic light sources, including the above LMF method, generally have a strong limitation on the maximum height of steps to be measured due to the phase periodicity. The maximum height can be extended by using multiple monochromatic light sources with different wave length; however, this requires multiple measurements and thus the advantage of one-shot methods such as LMF is spoiled. We developed a method that enables us to achieve this with a single measurement (Journal of the Japan Society for Precision Engineering, 2009, in Japanese). For this work, we won the Odawara Award 2nd Prize at the Vision Engineering Workshop (ViEW2007) in 2007.
When the target object is covered by a film layer, existing surface profiling algorithms do not give accurate results because of the refraction; furthermore, the thickness of the film can not be measured. To cope with this problem, we developed an algorithm that can simultaneously measure the surface profile and the thickness of the film: white light (Transactions of the Society of Instrument and Control Engineers, 2007, in Japanese), multiple-shot monochromatic light (Transactions of the Society of Instrument and Control Engineers, 2009, in Japanese), and one-shot monochromatic light (Journal of the Japan Society for Precision Engineering, 2009, in Japanese).
reinforcement learning. We applied the geodesic Gaussian kernel (GGK) to Khepera robot navigation tasks and showed that the proposed method provides faster learning than conventional method (Autonomous Robots, 2008), (demo).
We further showed that an update of robot control policies through online learning processes could be formulated as covariate shift. We applied the importance weighted cross validation (IWCV) method in value function approximation and proposed sample-reuse policy iteration (SRPI) (Neural Networks, 2009). We further applied a similar idea to a direct policy learning method and proposed reward-weighted regression with sample reuse (R^3) (ECML-PKDD2009).
Collecting data samples is often costly in real-world robotics and therefore designing good sampling policy is highly important. We applied our active learning method, ALICE, to sampling policy design in robot control and proposed active policy iteration (API) (Neural Networks, 2010).
In practical robot-cotrol problemss, reward samples are often measured by sensors and thus contain outliers. We applied a robust regression technique to policy iteration and proposed least absolute policy iteration (LAPI). (IEICE Transactions on Information and Systems, 2010) (demo).
active learning method, PALICE, to the wafer alignment problem and showed its usefulness (Machine Learning, 2009).
covariate shift, and proposed a variant of logistic regression for non-stationarity adaptation (Signal Processing, 2010).
covariate shift adaptation techniques and showed that the word segmentation accuracy of medical documents can be improved by adapting the general conversation data to the medical domain (Journal of Information Processing, 2009).
Document classification involves super-high dimensional sparse data, so dimensionality reduction is essential. We applied the semi-supervised dimensionality reduction method, SELF, to document classification, and showd that the classification accuracy is highly improved (IEICE Transactions on Information and Systems, 2009), (Machine Learning, 2010), (software).
We developed an efficient thesaurus construction method based on unsupervised dimensionality reduction (IEICE Transactions on Information and Systems, 2010).
semi-supervised learning). We developed an active learning strategy based on the cluster structure of unlabeled samples and a regression method based on the characteristic of human perception (IEICE Transactions on Information and Systems, 2010).
Face images possesses high diversity due to angles and lighting conditions, so training and test images often follow different distributions. We developed a new age prediction method based on the covariate shift adaptation techniques (ICPR2010).