WiML 2016 Accepted Posters: Abstracts

Abstracts listed here are for archival purposes and do not constitute proceedings for this workshop.

Konstantina Christakopoulou, Filip Radlinski and Katja Hofmann. Towards Conversational Recommender Systems. Contributed Talk
Abstract: People often ask others for restaurant recommendations as a way to discover new dining experiences. This makes restaurant recommendation an exciting scenario for recommender systems and has led to substantial research in this area. However, most such systems behave very differently from a human when asked for a recommendation. The goal of this paper is to begin to reduce this gap.

In particular, humans can quickly establish preferences when asked to make a recommendation for someone they do not know. Consider what would happen if a conference attendee in your home town, whom you have never met before, asked for a recommendation on where to eat dinner today. Most likely, you would start with one or two clarifying questions, perhaps whether the person likes seafood, or whether they have a car. These questions would depend on the context; for instance if there are great restaurants around the corner, then whether they have a car would be irrelevant.

We argue that such an interaction can be represented using online learning, where two types of learning are occurring. First, the person making the recommendation is learning about the preferences of the person asking. However, the attributes learned will be contextual, based on the likely follow-on answers. Second, the person making the recommendation is learning about which questions allow them to quickly reach a good recommendation in the current context. In this paper, we present a recommendation system that exhibits these two types of learning. Further, the learning is online: It immediately impacts future recommendations for this user, rather than requiring a batch reprocessing of information learned.

We present a bandit-based approach for online recommendation, applied to restaurant recommendation so as to ground it in a specific application. We use generalized Thompson Sampling to systematically sample questions to ask the user and to incorporate observed feedback. We use a matrix factorization approach to learn and adapt the embedding. We further propose and compare a range of alternative question selection strategies to identify characteristics of approaches that most effectively learn users' preferences.

Our main contributions are four-fold. (1) We propose a novel view of human-like recommenders that converse with new users to learn their preferences. (2) We successfully demonstrate a fully online learning approach for recommendation -- both using absolute and relative feedback. (3) We propose a systematic approach to incorporating offline data to initialize online learning recommenders, and demonstrate performance (4) We propose a set of item selection strategies for deciding what question to ask to a cold-start user to most quickly infer preferences, and demonstrate benefits of bandit-based strategies.

Our experiments with both synthetic and real world data compare different types of feedback and question selection strategies. We find that our framework can make very effective use of online user feedback, improving personalized recommendations over a static model by 25% after asking only 2 questions. Our results demonstrate dramatic benefits of starting from offline embeddings, and highlight the benefit of bandit-based explore-exploit strategies in this setting.

The paper was published in KDD’16.
Madalina Fiterau. Learning representations from time series data through contextualized LSTMs. Contributed Talk
Abstract: Time series are prevalent in biomedical data, from high-frequency vital signals, motion tracking waveforms and activity monitoring to longitudinal health indicators for patients with slow-progressing conditions such as osteoarthritis or cerebral palsy. The typical approaches to handle this data are either (1) too generic, not accounting for the characteristics (domain, mode of collection, context) of the dataset or (2) highly tailored to the dataset through painstaking feature engineering, requiring the intervention and supervision of domain experts in an iterative process. The former category includes tools such as PCA, FFT and embeddings via basis functions. An example of the latter is the processing performed on kinematic waveforms for gait analysis, which is impossible without a keen knowledge of biomechanics. While the former techniques are limited by their insufficient use of context, the latter are usually not generalizable, needing fine-tuning even for datasets of the same type.

We enable the use of context (structured covariates) to learning time series representations, thus obtaining a more generic framework for handling such data. This allows the incorporation of domain expertise in feature learning, flexibility in adapting to other datasets and easy integration of time series data with other data types such as structured covariates, text and images. By using multi-resolution LSTMs, we construct salient representations without the need for feature engineering.
Instead of deriving features for a given time series, the algorithm learns how to derive features for any number of temporal datasets, based on examples. We introduce two ways of attaining applicability for new types of data. First, we make the lower layers of the deep architecture flexible at runtime. This allows capturing the specifics of the data early in the transformation, while the rest of the network operates on a higher-order abstraction. For instance, we might train a deep network on step counts from accelerometers, but also use it on raw triaxial data. Additionally, this allows us to easily account for subject-specific traits. A second way to widen applicability is to directly model the similarities between samples. We consider the context of the data, which can include information about the domain, mode of collection or structured covariates linked to the time series. Contexts from different samples are introduced into a CNN that encapsulates context similarity, trained by backpropagating the difference in labels. The time series from both training and target samples are transformed through a multi-resolution network using LSTMs and convolutional layers. The feature examples given for the training samples are also incorporated.

We apply the procedure to accelerometer data from the Osteoarthritis (OA) Initiative, obtained from 2000 subjects, for a monitoring period of 7 days. This data is provided in terms of activity counts. The goal is to identify which subjects are at risk for fast OA progression. Previous studies on this data have applied simple transformations to the time series such as histograms or spline basis representations. Our model, which includes stacked LSTM layers for the time series combined with clinical covariates, improves the classification performance from 60% to 73%.
Sara Magliacane, Tom Claassen and Joris Mooij. Ancestral Causal Inference. Contributed Talk
Abstract: I would like to present the work described in the NIPS 2016 poster “Ancestral Causal Inference”.

Discovering causal relations from data is at the foundation of the scientific method. Traditionally, cause-effect relations have been recovered from experimental data in which the variable of interest is perturbed, but seminal work like the do-calculus and the PC/FCI algorithms demonstrate that, under certain assumptions, it is already possible to obtain significant causal information by using only observational data.

Recently, there have been several proposals for combining observational and experimental data to discover causal relations. These causal discovery methods are usually divided into two categories: constraint-based and score-based methods. Score-based methods typically evaluate models using a penalized likelihood score, while constraint-based methods use statistical independences to express constraints over possible causal models. The advantages of constraint-based over score-based methods are the ability to handle latent confounders naturally, no need for parametric modeling assumptions and an easy integration of complex background knowledge, especially in the logic-based methods.

Two major disadvantages of constraint-based methods are: (i) vulnerability to errors in statistical independence test results, which are quite common in real-world applications, (ii) no ranking or estimation of the confidence in the causal predictions. Several approaches address the first issue and improve the reliability of constraint-based methods by exploiting redundancy in the independence information. Unfortunately, existing approaches have to choose to sacrifice either accuracy by using a greedy method, or scalability by formulating a discrete optimization problem on a super-exponentially large search space.

Additionally, the second issue is addressed only in limited cases.

We propose Ancestral Causal Inference (ACI), a logic-based method that provides a comparable accuracy to the best state-of-the-art constraint-based methods, but improves on their scalability by using a more coarse-grained representation of causal information, which, though still super-exponentially large, drastically reduces computation time. Instead of representing all possible direct causal relations, in ACI we represent and reason only with ancestral relations (“indirect” causal relations). This representation turns out to be very convenient, because in real-world applications the distinction between direct causal relations and ancestral relations is not always clear or necessary. Moreover, once we reconstruct ancestral relations, we can always refine the prediction to direct causal relations by constraining standard methods to a much smaller search space.

Furthermore, we propose a method to score predictions according to their confidence. The confidence score is an approximation of the marginal probability of an ancestral relation. Scoring predictions enables one to rank them according to their reliability. This is very important for practical applications, as the low reliability of the predictions of constraint-based methods has been a major impediment to their widespread usage.

We provide some theoretical guarantees for ACI, like soundness and asymptotic consistency, and demonstrate that it can outperform the state-of-the-art on synthetic data, achieving a speedup of several orders of magnitude. We illustrate its practical feasibility by applying it on a challenging protein data set that so far had only been addressed with score-based methods.
Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli and Jascha Sohl-Dickstein. On the expressive power of deep neural networks. Contributed Talk
Abstract: We study the effects of the depth and width of a neural network on its expressive power. Precise theoretical and experimental results are derived in the generic setting of neural networks after random initialization. We find that three different measures of functional expressivity: number of transitions (a measure of non-linearity/complexity), network activation patterns (a new definition with an intrinsic link to hyperplane arrangements in input space) and number of dichotomies, show an exponential dependence on depth but not width. These three measures are related to each other, and, are also directly proportional to a fourth quantity, trajectory length. Most crucially, we show, both theoretically and experimentally, that trajectory length grows exponentially with depth, which is why all three measures display an exponential dependence on depth.
These results also suggest that parameters earlier in the network have greater influence over the expressive power of the network. So for any layer, its influence on expressivity is determined by the remaining depth of the network after that layer, which is supported by experiments on fully connected and convolutional networks on MNIST and CIFAR-10.
Amy Zhang, Xianming Liu, Tobias Tiecke and Andreas Gros. Using Convolutional Neural Networks to Estimate Population Density from High Resolution Satellite Images. Contributed Talk
Abstract: This work is part of the Internet.org initiative to provide connectivity all over the world. Population density data is helpful in driving a variety of technology decisions, but currently, a microscopic dataset of population doesn't exist. Aggregate population counts on the spatial scale of provinces or districts are known from population censuses but alone are insufficient, as these areas vary in geographical size and do not provide insight about population distributions on a granular level -- current state of the art population density datasets are at ~1000km^2 resolution. To create a better dataset, we have obtained 500TB of satellite imagery at 50cm/pixel resolution to feed through our building classification pipeline. Our final data set has a spatial resolution of 5 meters and thereby improves over previous countrywide data sets by multiple orders of magnitude. We've analyzed 30 countries and counting.

Why this is a hard problem: We obtain our training data from labelers, collecting about 300k 30mx30m samples labeled as house/no house. Using golden data we can measure the accuracy of our labelers, and discovered that we are obtaining only 85% accuracy. (Our model is more accurate than our labelers). Another issue is the heavy skew of the data. Even using conventional image processing algorithms to preselect 30x30m candidate areas, only ~2% of these areas contain buildings.

Our contribution is a fast building classification pipeline, that can run through an average sized country in a matter of hours on appropriately sized infrastructure. We train custom convolutional neural networks that are tailored for satellite imagery recognition. The building classification net is globally trained and obtains precision and recall of > 90\% in most countries. We are also working on a weakly supervised footprint detection model that process larger images with more context and produces a mask of location and shape of each building. By using a binary labeling scheme (the image does/does not contain a building), the neural network learns “what” and “where” simultaneously. It succeeds in identifying outlines of buildings and highlighted those for which it had high confidence while suppressing areas not likely to contain human-made structures.
Qinyi Zhang, Sarah Filippi, Arthur Gretton and Dino Sejdinovic. Large-Scale Kernel Methods for Independence Testing
Abstract: Representations of probability measures in reproducing kernel Hilbert spaces provide a flexible framework for fully nonparametric hypothesis tests of independence, which can capture any type of departure from independence, including nonlinear associations and multivariate interactions. However, these approaches come with an at least quadratic computational cost in the number of observations, which can be prohibitive in many applications. Arguably, it is exactly in such large-scale datasets that capturing any type of dependence is of interest, so striking a favourable tradeoff between computational efficiency and test performance for kernel independence tests would have a direct impact on their applicability in practice. In this contribution, we provide an extensive study of the use of large-scale kernel approximations in the context of independence testing, contrasting block-based, Nystr"om and random Fourier feature approaches. Through a variety of synthetic data experiments, it is demonstrated that our novel large scale methods give comparable performance with existing methods whilst using significantly less computation time and memory.
Anjali Silva, Steven Rothstein and Sanjeena Dang. Model selection for clustered high-throughput genomic data via Poisson mixture models
Samira Ebrahimi Kahou, Vincent Michalski, Roland Memisevic and Christopher Pal. RATM: Recurrent Attentive Tracking Model
Abstract: We present an attention-based modular neural framework for computer vision. The framework uses a soft attention mechanism allowing models to be trained with gradient descent. It consists of three modules: a recurrent attention module controlling where to look in an image or video frame, a feature-extraction module providing a representation of what is seen, and an objective module formalizing why the model learns its attentive behavior. The attention module allows the model to focus computation on task-related information in the input. We apply the framework to several object tracking tasks and explore various design choices. We experiment with three data sets, bouncing ball, moving digits and the real-world KTH data set. The proposed Recurrent Attentive Tracking Model performs well on all three tasks and can generalize to related but previously unseen sequences from a challenging tracking data set.
M. Julia Flores, R. Martin Krug and Javier Lara Valtueña. Supervised Classification to mate-in-one exercises aimed at chess training for beginners
Abstract: In this work we have tackled the problem of chess from a trainer perspective, aiming for the most basic, yet highly relevant, type of exercise: mate-in-one. Heretofore, the greatest efforts connecting computational intelligence and chess were to make programs play good chess. We show that other possibilities can be exploited. Taking advantage of its formal structure, together with the perspective of a chess trainer, we defined the problem as the automatic prediction of the difficulty for a given mate-in-one exercise. We have extracted up to 17 valuable features, i.e. number of pieces per player, number of possible mates or distance of black king to the board centre. Difficulty is the target variable, using three class labels.

Our training data have been processed, using a selection of 300 exercises from the popular material known as 'Polgar'. All games have been downloaded, and the necessary features extracted from them. In order to perform supervised classification, we needed a class labelling process, which was carried out by the chess trainer (expert). Our methodology was incremental in the sense that we started with an initial and immediate set of attributes that have been enriched with the addition of more sophisticated ones, which have shown to achieve better performance in accuracy terms.

In this particular work we have used as classifiers decision trees, Bayesian networks classifiers and Artificial Neural Networks. We selected representative models, and 'default' parametrisation, namely C4.5, Naive Bayes and Multilayer Perceptron, since the goal of this work is a preliminary study of the suitability of the approach and a comparison among distinct paradigms for classification. We used for evaluation 10-fold cross validation. Decision trees provided the best results not only in accuracy terms but also in efficiency (less CPU time). However, Naive Bayes classifier is the one which gets the highest benefit from feature selection.

The main contributions of the current work are: (1) the problem set-up of transforming the chess learning problem into a classification task and (2) a preliminary study on how distinct classifiers perform. We plan to work further on this problem, using more sophisticated classifiers and trying to work on getting more pre-labelled instances. We also plan to create an on-line platform with users solving exercises, and utilise further variables as ELO/level of trainee players or time used for exercise resolution.

As an important remark, we must say that this automatic classification of exercises can be of great utility for chess trainers, together with on-line platforms providing a personalised, adaptive and training plan. For example, if a certain player tries to solve an exercise of low difficulty and s/he fails, the system shouldn't jump to higher difficulty. On the contrary, if the user solves the proposed exercises successfully, and quickly enough, then higher difficulty exercises could be provided. Another utility of this problem could be the automatic generation of exercises, for example, taking the final move in real games and automatically label them.
Rūta Užupytė and Tomas Krilavičius. A clustering approach to electricity consumption profiles using smart meters data
Abstract: Conventional electricity meters are being rapidly replaced by smart meters, offering improved accuracy and detailed information on energy consumption. High granularity electricity usage data (e.g., hourly) can be refine into more valuable information, e.g., profiles of electricity users. Accurate information about the actual behavior of electricity users would allow electricity providers to design tariffs for the different classes of electricity customers as well as to develop a better marketing and trading strategies. On the other hand, customers can lower their energy bills by consuming more energy in lower price periods. The development of methodology for customer segmentation based on their consumption habits is the objective of this research.
Proposed profiling approach is based on the periodicity analysis and well known clustering technique – k–means. In order to identify seasonality and other important periods we apply spectral analysis. Analyzing periodograms of electricity consumptions we can identify the most pronounced periods and use them as a basis for analysis. In the following step we identify typical patterns representing each customer’s electricity demand during the selected period, e.g. typical twenty–four hours patterns. Based on the idea that users may have different energy consumption patterns during weekends and working days we split the data into two smaller data sets which are analyzed separately. Moreover, data normalization is performed in order to ensure that clustering will be performed based on the shape of the pattern and not on the total usage. K–means clustering is used to find the main consumption patterns and the most appropriate number of clusters can be identified by performing experiments with different k values and evaluating different internal clustering quality criteria (e.g., Dunn index, Davies-Bouldin index, Mean index adequacy, Cluster dispersion indicator). Extracted consumption patterns are once again clusterized using k-means algorithm and clustering adequacy indices are applied in order to select an appropriate number of electricity profiles. The last stage is results consolidation: working days and weekends/holidays results are combined by analyzing the frequencies of all possible combinations.
Proposed clustering approach was applied for a real world data set consisting of 3 years (January 2011 to December 2013) hourly measurements for 3753 industrial users. Results show that proposed algorithm identifies well-separated clusters clearly representing the behavior of electricity users. Moreover, method leaves to the supplier possibility of defining the number of profiles by adjusting the threshold value for the minimum size of profile. Furthermore, algorithm enables the possibility to choose how to design profiles for load patterns with uncommon behavior.
Anastasia Podosinnikova, Francis Bach and Simon Lacoste-Julien. Beyond CCA: Moment Matching for Multi-View Models
Abstract: In this work, we focus on multi-view data, i.e. data represented by pairs of aligned observations such as images or videos and text. The goal is to extract common latent information from such data. A usual tool for that is canonical correlation analysis (CCA) or its probabilistic analog---probabilistic canonical correlation analysis. The later, however, is unidentifiable, which means that the canonical directions and latent factors are not uniquely defined by the model. This causes interpretability problems since interpreting something which is not uniquely defined does not make much sense.

We address this issue by introducing three semi-parametric extensions of probabilistic canonical correlation analysis and prove their identifiability. The semi-parametric assumption means that we do not assume any probability distribution of latent factors, which means that the models have higher expressive power. The models differ by the type of data for each view that they can handle: count data or continuous data or the combination of the two.

We consider moment matching techniques for the estimation in these models. For that, by drawing explicit links between the new models and a discrete version of independent component analysis (DICA), we first extend the DICA cumulant tensors to the new discrete version of CCA. By further using a close connection with independent component analysis, we introduce generalized covariance matrices, which can replace the cumulant tensors in the moment matching framework, and, therefore, improve sample complexity and simplify derivations and algorithms significantly. As the tensor power method or orthogonal joint diagonalization are not applicable in the new setting, we use non-orthogonal joint diagonalization techniques for matching the cumulants.

We demonstrate performance of the proposed models and estimation techniques on experiments with both synthetic and real datasets. In particular, we estimate linear transformations for each view on aligned text documents where each pair contains the same text in two different languages (English and French). The estimated linear transformations can be seen as topics for each view. We observe that algorithmically derived topics represent the same concepts but described in different languages. This work was previously published as [1].

[1] A. Podosinnikova, F. Bach, S. Lacoste-Julien. Beyond CCA: Moment Matching for Multi-View Models. In Proceedings of the 33-rd International Conference on Machine Learning (ICML), 2016.
Laura Morán-Fernández. A distributed feature selection method based on data complexity measures
Abstract: In the era of Big Data, 2.5 billion gigabytes of data is generated every single day on Internet. The advent of this type data -which it is not only big in volume, but also in complexity and variety- has posed a big challenge for machine learning researchers since it is difficult to deal with a high number of features due to the curse of dimensionality. The scaling up problem appears in any algorithm when data size increases beyond its ability to capture, curate, manage and process data, damaging its performance and efficiency. To deal with the problem of the high dimensionality it is advisable to investigate the effects of the application of feature selection. The use of an adequate feature selection method can lead to an improvement of the inductive learner, either in terms of learning speed, generalization capacity or simplicity. However, the advantages of feature selection come at a certain price, as the search for a relevant feature subset introduces an extra layer of complexity. This new layer increases the running time and memory requirements, making these algorithms very inefficient when they are applied to problems that involve very large datasets. Ironically, standard feature selection becomes impracticable on large datasets, which are the ones that would benefit most from its application.
Trying to overcome the drawbacks mentioned above, over the last years many distributed methods have been developed instead of the traditional centralized approaches. The first reason is that, with the advent of network technologies, data is being increasingly available already distributed in multiple locations, and it is not economic or legal to gather it in a single location. And, second, when dealing with large amounts of data, most existing feature selection methods are not expected to scale well, and their efficiency may significantly deteriorate or even become inapplicable. Thus, a possible solution might be to distribute the data in several subsets, run a feature selection method on each partition of the data and then combine the results into a single feature subset. There are two main techniques for partitioning the data: horizontally (by samples) and vertically (by features).
In this work, we present a methodology in which several rounds of feature selection will be performed on different partitions of the data, using both horizontal and vertical techniques. Different than existing procedures to combine the partial outputs obtained from each partition of data, we propose a merging process using the theoretical complexity of these feature subsets [1]. Data complexity measures are a relatively recent proposal to identify data particularities which imply some difficulty for the classification task beyond estimates of error rates. The experimental results on several datasets demonstrate that our proposal is able to reduce significantly the running time -showing a high average speed up value- while maintaining, or even improving, the classification accuracy than when using the original feature selection algorithms over the whole datasets.

[1] Morán-Fernández, L., Bolón-Canedo, V., & Alonso-Betanzos, A. (2016). Centralized vs. distributed feature selection methods based on data complexity measures. Knowledge-Based Systems.
Veronica Bolon Canedo. Beyond feature selection: an approach for reducing the cost of the features
Abstract: In an era of growing data complexity and volume and the advent of Big Data, feature selection (FS) has a key role to play in helping reduce high-dimensionality in machine learning problems. However, although new FS methods are constantly being developed, most focus on removing irrelevant and redundant features and only a few take into account the cost of obtaining input features.

There are some situations in which a user is not only interested in maximizing the merit of a subset of features, but also in reducing costs that may be associated to features. For example, in medical diagnosis, symptoms observed with the naked eye are costless, but each diagnostic value extracted by a clinical test is associated with its own cost and risk. In other fields, such as image analysis, the computational expense of features refers to the time and space complexities of the feature acquisition process. This is a critical issue, specifically in real-time applications, in which the computational time required to deal with one or another feature is crucial, and also in the medical domain, where it is important to save economic costs and to also improve the comfort of a patient by preventing risky or unpleasant clinical tests (variables that can be also treated as costs).

Trying to obtain a trade-off between a FS metric and the cost associated to the selected features -with the goal of selecting relevant features with low cost-, we propose a new framework for cost-based FS. The framework consists of adding a new term to the evaluation function of any FS method so that it is possible to reach a trade-off between a filter metric (e.g. correlation or mutual information) and the cost associated to the selected features. A new parameter, called lambda, is introduced to adjust the influence of the cost into the evaluation function, allowing the users fine control of the process according to their needs.

The framework was tested with three popular FS methods: CFS, mRMR and ReliefF. Experimentation was executed over a broad suite of benchmark datasets, including the challenging DNA microarray data. Results after performing classification with a support vector machine, showed that the approach is sound and allows the user to reduce the cost without compromising the classification error significantly (according to the statistical tests). Moreover, we tried the framework on a real problem -tear film lipid layer classification- in which it is necessary to extract several features from the images. However, the extraction step involves the computation of several statistics which is not an homogeneous operation, so the use of one or another feature makes a big difference in computational time. Using the whole set of extracted features is not feasible, since it takes more than 1 minute and prevents the use of the classification methodology in real time. After applying our framework, we were able to reduce the computational time to only 7ms, with classification accuracy over 92%.
Soukayna Mouatadid and Jan Adamowski. Forecasting short-term urban water demand using extreme learning machines
Abstract: In many countries around the world, the rising severity of climate change coupled with population growth and economic development are leading to an increase in urban water demand (Gleick 2011). To help ensure dependable water availability and minimize peak water use, reliable urban water demand forecast models must be developed. Such forecasting models allow water utilities to make operational and strategic decisions, thereby improving water security and water consumption sustainability (Beal and Stewart 2011). In this context, accurate short-term water demand forecasting is an important component of optimizing water supply systems and implementing effective water demand management programs (Ghiassi et al. 2008).

In this study, water demand for the city of Montreal was forecasted for one- and three-day lead times, using one linear forecasting method (MLR) and three non-linear, self-learning methods: Artificial Neural Network (ANN), Support Vector Regression (SVR) and Extreme Learning Machine (ELM) models. The aim of the study was to determine which method results in the most accurate forecasts. In order to do so, a model using ELM as a learning method was proposed and tested for one- and three-day lead times. The performance of the developed ELM model was then compared to that of the developed MLR, ANN and SVR models.

ELM has been introduced in forecasting applications in computational science, but remains uncommon in hydrological forecasting research (Ravinesh and Şahin 2015). The study explores, for the first time, the use of an ELM-based method to forecast short-term urban water demand. Models were based on different combinations of the main input variables (e.g., daily maximum temperature, daily total precipitation and daily water demand), for which data were available for Montreal, Canada between 1999 and 2010.

Based on the squared coefficient of determination, the root mean square error and an examination of the residuals, ELM models provided greater accuracy than MLR, ANN or SVR models in forecasting Montreal urban water demand for 1 day and 3 days ahead, and can be considered a promising method for short-term urban water demand forecasting. Possible future studies could investigate which of these data driven models is suitable for forecasting long-term water demand values in other locations with different physical characteristics and different climates. Additionally, further research could determine whether there is a significant link between forecast accuracy and climate.

Note: this work is published in the Urban Water Journal (October 2016) as a full research paper.

Beal, C. and Stewart, R.A., 2011. South East Queensland residential end use study. Urban Water Security Research Alliance, 47, 1836–5558.
Ghiassi, M., Zimbra, D. and Saidane, H., 2008. Urban water demand forecasting with a dynamic artificial neural network model. Journal of Water Resources Planning and Management, 134 (2), 138–146.
Gleick, P.H., 2011. The world’s water volume 7: The biennial report on freshwater resources. Washington, DC: Island Press.
Ravinesh, C.D. and Şahin, M., 2015.Application of the extreme learning machine algorithm for the prediction of monthly effective drought index in eastern Australia. Atmospheric Research, 153, 512–525.
Elena Erdmann and Kristian Kersting. Topic Classification by Human Machine-Learning Interaction
Abstract: When researching trends in news, researchers in communicational studies are just beginning to understand the opportunities of new technologies. Instead of handpicking every article to analyze it is now possible to access corpora of big amounts of texts, eg all articles that have ever appeared in the New York Times. When studying the discourse about a given topic, a simple keyword search can already provide most articles that are linked to the topic. In our cooperation with researchers from journalism studies, we found that they preferred topics generated by topic models such as LDA. What they specifically liked about these topics was the more comprehensive overview, given by the presentation as a multinomial distribution over words. However, when investigating one specific topic, unsupervised methods do not necessarily provide the topic of interest.

We believe that the key to a trustworthy topic classification of documents lies in an entirely interpretable model for classification. Our main objective is to find all documents related to a certain topic in a large corpus of documents. In her mind, the user has a very detailed idea what these documents look like. When she reads a document, it comes easy to her to label the document as belonging to the topic or not. However, the user cannot formalize the topic to enable the computer to classify documents just as easily. We therefore propose an interactive approach to classify documents. We build a classifier to label documents and ask feedback from the user on both the classification and the features used by the classifier.

In a way, we regard the topic in the user’s mind as a classification algorithm, that we now try to reverse-engineer. We start by training a classifier on a rough initial labeling of documents, such as given by a keyword search. We then distinguish the most influential features for our classifier. We present those features to the user and ask her to label these features as indicative, irrelevant or contradictory to the topic. We then use features that have been rated differently by user and machine, to actively sample those documents that are most likely to be misclassified by the algorithm. We ask the user to label these documents and retrain a new classifier on the user-labeled corpus. We repeat this process until the user is satisfied by the model for the topic as provided by the algorithm and the accuracy of the predictions.

This work was partly supported by the DFG Collaborative Re- search Center SFB 876 project A6 and the Dort- mund Center for Media Analysis (DoCMA).
Danielle Belgrave, Raquel Granell, John Guiver, Christopher Bishop and Iain Buchan. A Bayesian machine learning approach to latent variable modelling to accelerate endotype discovery
Abstract: The grand challenge of identifying personalised treatment and management strategies for medical conditions relies on the advancement of statistical learning methods for discovery of the subtypes of complex diseases (which may indicate disease "endotypes") by using 'intelligent phenotypes'. Bayesian machine learning methods can provide a flexible framework for endotype discovery through the application of latent variable modelling to disambiguate diseases where there are heterogeneous phenomena with comorbid symptoms. The Bayesian framework allows us to upscale these models in order to integrate high-dimensional longitudinal data from immune responses, genetic data and clinical data. Probabilistic programming provides a powerful tool to express such Bayesian machine-learning problems. We have applied this novel approach to using Bayesian latent trajectory models to disaggregate complex evolving endotypes to enable the discovery of clinically meaningful subgroups of allergic diseases. Our generalised framework for endotype discovery has been applied to understanding longitudinal developmental profiles of multiple, co-occurring symptoms of asthma and allergic diseases.
The term ‘‘atopic march’’ has been used to imply a natural progression of a cascade of symptoms from eczema to asthma and rhinitis through childhood. We propose that this hypothesised paradigm arose from cross-sectional analyses of longitudinal studies, and may reflect a population pattern that may not predominate at the individual level. In this study, we developed a Bayesian machine learning modelling framework with an expectation propagation algorithm for approximate inference to identify distinct latent classes based on individual changing disease profiles of eczema, wheeze, and rhinitis. We hypothesized that these different patterns represent the existence of distinct endotypes with distinct underlying pathophysiological mechanisms. All machine learning models were specified using Infer.NET. In order to capture disease heterogeneity and encapsulate possible different patterns of symptom progression in individual children over time, we compared model evidence of different framework and prior assumptions. We first used two Markov chain probability models that assumed that the probability of an event is conditioned on the disease state at the previous time-point. These models were assumed to be governed by a single mutinomial latent variable representing probabilistic class assignment based on trajectories of symptom profiles over time. For each patient, we assumed a uniform Dirichlet prior distribution with equal probability assignment to each latent class. These models allowed us to test the assumption that transitions or the sequence from one disease state to another follow the “atopic march” profile. We then allowed a more flexible modelling framework which assumed that other than random temporal fluctuation, each child's joint pattern of eczema, wheeze, and rhinitis can be explained by their belonging to a particular disease class profile. This model makes no constraints on (1) the structure of transitions within and between different symptoms over time or (2) the progression from one symptom to another. The conditional transition probabilities were evaluated and assumed to be independent. Using a graphical modeling framework, we then extended this model to include potential biomarkers which may allow us to infer more accurate endotype prediction.
Lucy Yin, Jennifer Andrews and Thomas Heaton. Bayesian Approach to Real-time Earthquake Detection
Abstract: Earthquake Early Warning (EEW) is a system that aims to mitigate seismic hazards and reduce casualties and economic losses. The system extensively relies on pattern recognition algorithms to detect and estimate earthquake-shaking intensity based on real-time ground motion information, and then provides alerts before the onset of the destructive wave at specific facilities. A prototype of such system is currently running and testing in California, Japan, Mexico, and other seismically active regions.

One of the challenges of EEW is to provide a reliable and rapid detection of earthquake signals from ambient noise (such as vehicle collisions, explosions, glitch in the signal) and teleseismic events (larger earthquake from a distance that creates no damage locally). Incorrect classification of the signals may cause false alarms that create a negative impact on the social adaptation of the system. Conventional methods of earthquake detection tend to focus on the time-signal analysis of few seconds data that causes a significant delay in alert delivery.

We propose a method that uses the three-component acceleration and velocity waveform data and Epidemic-Type Aftershock Sequence (ETAS) seismicity forecast information in parallel, and then produce the posterior prediction by combining the predictions from the heterogeneous sources using a Bayesian probabilistic approach. The waveform information is feed into a logistic regression model that provides a likelihood probability of the signal observation; the ETAS forecast provides a prior information of the spatio-temporal earthquake occurrence probability. We collected 2446 three-component strong-motion records for the training and testing. The rapid prediction is available as soon as 0.5 s after the trigger at a single station. The prediction precision of the proposed method achieves 98% at the first prediction and the classification accuracy increase with time. Leave-one-out validation method also demonstrates confidence of robust performance for future earthquake signal detections. We compared the method with the conventional earthquake detection method, Tc-Pd EEW classification criterion, and find that our prediction is 83% faster with 5% higher precision rate. Our new strategy has shown promising results and the implementation of this methodology could provide significantly faster and more reliable EEW warnings to regions near the earthquake’s epicenter where the strongest shaking is observed.
Alexandra Gessner, Michael Osborne, Roman Garnett and Philipp Hennig. Fast Bayesian Quadrature with exact DPP sampling
Abstract: *Bayesian quadrature (BQ)* is a rigorous framework for the construction of posterior measures over integrals $F = \int f(x)\,dx$---both for the active selection of evaluation nodes $[x_i]$, and inference from them, using Gaussian process (GP) models. BQ is arguably the most promising avenue toward a solution, in the long term, of the foundational challenge of high-dimensional integration at the heart of (not just) probabilistic programming. Among the key problems on this path is the currently high computational cost associated with choosing evaluation nodes. (The other key problem is the design of the GP measure, which we do not address here.)

Recent theoretical and practical advances motivate the use of determinantal point processes (DPPs) for this node selection. A DPP is a stochastic point process defined relative to a positive definite kernel, such that the probability of a point set $X$ is proportional to the determinant of the kernel Gram matrix $k_{XX}$. Hence, there is one DPP for every GP, and vice versa; an appealing property for the BQ setting. On the theoretical side, Bardenet & Hardy [^1] showed that Monte Carlo estimators (the base limit of BQ) arising from DPP samples converge to the true integral at a rate strictly greater than that of the classic MC estimate from iid. random nodes. On the practical side, Gunter et al. [^2] presented WSABI, a lightweight algorithm for BQ on probability measures that outperforms MCMC methods (in runtime) on problems of low dimensionality. WSABI actively adapts its evaluation strategy to the integrand's shape. It defines a transformation of a latent GP model onto a measure with strictly positive support; approximate inference on this measure yields an evaluation utility that is the product of the posterior variance and the square of the posterior mean of a GP. This means the WSABI evaluation strategy is related to sampling from an Immanantal Point Process (IPP), a generalization of DPPs---a connection not used in [^2].

We show that it is feasible to draw exact samples from the WSABI utility function in computational cost at most cubic in the number of samples drawn, and linear when using popular approximation schemes for GP regression. Doing so drastically reduces the computational cost of the WSABI algorithm, while simultaneously improving its sample efficiency. Using our sampling algorithm within WSABI widens the gap between that method and classic MCMC methods in runtime on low- to medium-dimensional problems. Our work also highlights the exciting fact that WSABI uses what essentially amounts to an IPP sampling scheme. IPPs have sometimes been mentioned as a way to address the exploration-exploitation problem in integration, but the idea has not yet been used in practice. Our work is the first to do so.

[^1]: R. Bardenet & A. Hardy. *Monte Carlo with Determinantal Point Processes*. arXiv prePrint 1605.00361v1, 2016

[^2]: T. Gunter, M.A. Osborne, R. Garnett, P. Hennig & S. Roberts. *Sampling for Inference in Probabilistic Models with Fast Bayesian Quadrature*. NIPS 2014
Lidia Contreras-Ochando and Cesar Ferri. Wind-sensitive interpolation of urban air pollution forecasts
Abstract: Every person living in urban areas is exposed to outdoor air pollution. In this sense, air pollution has been linked to numerous premature and pre-native deaths each year and some works reckon that vehicle emissions produce over 90% of air pollution in cities in these zones. Although many cities provide air quality data in a public and open way, in many cases, this information is presented with significant delays because of the need to validate the data (three hours in the city of Valencia in Spain, our case of study). This delay can represent a problem since high pollution levels are not detected in real time. Besides this, the information is limited to the area where the measurement stations are located and in most of the cases, the number of sensors is pretty limited due to the high cost of equipment and its maintenance.

Considering these restrictions, in this work we address the problem of forecasting pollution levels in real time. In this sense, we present some results in predicting and interpolating urban air pollution forecasts for the city of Valencia based on historic data about pollution levels, traffic intensity and meteorological conditions. In first place, we compare several regression models able to predict the levels of four different pollutants (NO, NO2, SO2, O3) in the six measurement stations of Valencia. Wind is a key feature in the propagation of pollutants around the city. For that reason, we study different techniques to incorporate this factor in the regression models, concretely, we study how traffic intensity is affecting pollution levels depending on the wind. These methods together used on Random Forest models are able to produce the best results in the predictions of the pollutants. In second place, we analyse how to interpolate forecasts all around the city. For this, we propose a new interpolation method that takes wind direction into account and we compare this proposal with respect to well-known interpolation methods like Inverse Distance Weighting (IDW) or Kriging. We consider this new method as a dynamic method by the fact that it is sensitive to the context of the interpolation point. Our experiments using cross-validation show that this new method is able to improve the results for some of the pollutants, especially when more data is used in the interpolation. Finally, by using these pollution predictions and interpolation methods, we are able to generate a real-time air pollution maps of the city of Valencia. We have incorporated these air pollution maps in a public website.

Note: This work has been presented at ICCS 2016 as a full research paper. Full text is available at http://www.sciencedirect.com/science/article/pii/S187705091630758X
Míriam Bellver, Xavier Giró I Nieto and Ferran Marqués. Efficient search of objects in images using deep reinforcement learning
Abstract: When we humans look at an image, we always perform a sequential extraction of information in order to understand its content. First, we fix our gaze to the most salient part of the image, and from the information extracted we guide our look towards another point of the image, until we have analyzed all the relevant information of it. This is our natural behaviour to gather information from our surroundings. Traditionally in computer vision, images have been analysed at the local scale following a sliding window scanning, often at different scales. This approach analyses the different parts of the images independently, without constructing a correlation among them. Just by introducing a hierarchical partition of the image, we can more easily exploit the correlation between regions through a top-down scanning which firstly takes a global view of the image to sequentially focus on the local parts that contain the relevant information. Moreover, if we train a deep architecture that is not based on rewarding regions observed independently, such as traditional object proposals, but rewards successful long-term searches by connecting the different regions observed, we can achieve a sequential detection of objects, which is proven to be richer in information compared to use simple independent fixations.

The goal of this ongoing research is to perform an efficient detection of objects in images. In order to be efficient, the key idea is to focus on those parts of the image which contain richer information and zoom on them, guiding a hierarchical search for objects. An intelligent entity capable of deciding where to focus the attention in the image is trained using deep reinforcement learning techniques. This RL agent first looks the whole image and decides which of the regions of a quadtree partition plus a central region is richer to find a category of objects. The RL agent is trained using deep Q-learning using a similar architecture to the one used by DeepMind [1].

The input of the network is a convolutional descriptor of the region observed at the current step and a history vector that describes the previous steps of the search. This idea was also used in [2]. Our main difference with the approach of such paper, is that we use a fixed hierarchical partition to guide our sequential search. Furthermore, in order to be efficient, sharing of convolutional features is a key aspect of the pipeline of our system. Convolutional features from VGG-16 [3] are extracted from the initial whole resolution picture, and then the descriptors for each subpartition are cropped from this feature map.

[1] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.

[2] Caicedo, J. C., & Lazebnik, S. (2015). Active object localization with deep reinforcement learning. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2488-2496).

[3] Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representation 2015.
Beyza Ermis and Taylan Cemgil. Differentially Private MCMC Algorithms for Distributed Matrix Factorization
Abstract: Differential Privacy (DP) is currently the most widely accepted privacy definition. DP is designed to provide a strong form of privacy protection of individual user’s information and allows making inference from data with actual utility. In the DP scenario, several individuals provide their data to the curator/aggregator whose job is to combine the local results. The curator subsequently either publishes some aggregate information or iteratively answers queries by adversaries in a privacy-preserving way. Concisely, DP ensures that an adversary should not be able to reliably infer whether or not a particular individual is participating in a database, even with unlimited computational power and access to every entry except for that particular individual’s data.

We investigate a related but different scenario, where each data site concurrently acts both as a data provider and a data consumer. In our scenario, N distinct parties (say, companies) hold sensitive data from several individuals, and they wish to jointly estimate parameters of a statistical model conditioned on all data available in a way that respects each party’s privacy while still protecting the privacy of each individual’s data instance. The parties will collaborate to compute some function of the entire data, yet any party in the system can learn very little additional information except the result functions and its own data. We develop an algorithmic framework that allows the participants to build a relationship of mutual benefit within a privacy-preserving mechanism of diverse data sources and prediction needs.

We shall assume that data instances are arranged in a matrix form so that each data site contains a subset of the matrix entries. A natural statistical model for such relational data is collective matrix factorization (CMF) model. The key idea in CMF is that the latent structures are shared across the matrices, which enables transferring information between them. Inspired by this idea, we introduce a privacy-aware collective Bayesian matrix factorization approach under conditions where sharing of individual records across data sites is not permitted between individual parties; but local data is fully accessible to its owner.

We describe a provably differentially private treatment of the proposed CMF models for the two notions of privacy: i) global-DP and ii) local-DP. To provide the global-DP, we use a novel connection between DP and samples from the posterior via Stochastic Gradient Langevin Dynamics (SGLD). Then, we propose two different CMF models that provide local-DP which is much stronger than the global-DP with the use of Markov chain Monte Carlo (MCMC) methods. We test our algorithms on distributed, collective matrix factorization problems and their applications to recommendation systems. With these experiments, we first show that the impact of the added noise to the original data can be reduced by sharing the results of differentially private computations at different sites. Then, we demonstrate empirically that as the number of sites that participate in the framework increases, we can support \epsilon-differentially private estimates while still preserving good utility and achieving better prediction performance over learning at a single site.
Claire Vernade, Paul Lagrée and Olivier Cappé. Multiple-Play Bandits in the Position-Based Model
Abstract: Online advertising requires to place multiple items at different positions of the same webpage. This task can be cast into the multiple-play semi-bandit setting. However, a major concern is when the system cannot decide whether the user feedback for every item is exploitable. Indeed, some of the content may be ignored by the user, hence an unexploitable feedback. The present work proposes to exploit available information regarding the display position bias under the so-called Position-based click model (PBM).

The feedback of the system is a sequence of ones (clicks) and zeros. In order to observe a click, two conditions must be met : the user must reach the position of the ad *and* like the advertised content. Consequently, observing a zero is ambiguous. It may either result from a lack of attention or from indifference to the advertised content. The system is unable to disambiguate these situations, which causes a bias in the estimation of the click-through rate (CTR). Although the parameters corresponding to the position effect are related to the page layout and are assumed to be known in advance, the click-through rate of each item remains to be learned in a sequential fashion minimizing an ad-hoc regret metric.

In several recent works on multiple-play bandits, the Cascade Model addresses a simpler version of the suggested problem because the learner is able to know whether each position has been seen or not. On the other side, the Dependent Click Model [1] considers the position-based model in the context of top down list evaluation with at most one click which only leaves ambiguity when no click is observed.

We provide a novel asymptotic regret lower bound whose proof relies on new techniques from [2] that we tried to make as generalizable as possible to other feedback models. This bound allows to evaluate how much the complexity of the problem is affected by the censoring effect of the PBM. We propose optimistic policies to solve the given problem: PBM-KLUCB builds on ideas from [3] and an approximate version of Thompson Sampling from [4] is suggested as well as its matching exact version. A finite-time analysis of PBM-KLUCB is provided, which asymptotically matches the proven lower bound, hence validating its optimality. We thoroughly evaluate all proposed policies on synthetic and real data and compare them to a state-of-the-art algorithm for learning-to-rank.

[1] S. Katariya, B. Kveton, C. Szepesvári, & Z. Wen. DCM bandits: Learning to rank with multiple clicks. In Proc. of Internation Conference on Machine Learning, 2016.
[2] E. Kaufmann, O. Cappé, and A. Garivier. On the complexity of best arm identification in multi-armed bandit models. Journal of Machine Learning Research, 2015.
[3] R. Combes, M. S. T. M. Shahi, A. Proutière, et al. Combinatorial bandits revisited. In Advances in Neural Information Processing Systems, 2015.
[4] J. Komiyama, J. Honda, & H. Nakagawa. Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays. In Proc. of the International Conference on Machine Learning, 2015.
Cynthia Rush and Ramji Venkataramanan. Finite Sample Analysis of Approximate Message Passing
Abstract: In this talk I consider the setting of high-dimensional linear regression, where the goal is to estimate a high-dimensional vector B_0 from a noisy measurement y = AB_0 + w. For example when B_0 is sparse and the noise w is Gaussian, this model represents the compressed sensing problem. Approximate Message Passing (AMP) is a low-complexity, scalable algorithm derived as an approximation to loopy belief propagation algorithms on dense factor graphs (like the one representing the high-dimensional regression task) that has been studied extensively for this problem.

In this work, we analyze the performance of Approximate Message Passing in the regime where the problem dimension is large but finite. AMP works by iteratively updating estimates of the unknown vector B_0. Under suitable assumptions on the measurement matrix A, AMP has the attractive feature that its performance can be accurately characterized in the asymptotic large system limit by a simple scalar iteration called state evolution. For example, it can be shown that the mean square error of the estimate at any step of the algorithm is predicted by state evolution.

Previous proofs of the validity of state evolution have all been asymptotic convergence results. In this talk, we derive a concentration result for AMP with i.i.d. Gaussian measurement matrices with finite dimension n by N. The result shows that the probability of deviation from the state evolution prediction falls exponentially in n (e.g. the mean square error of the estimate at any step of the algorithm concentrates on a value predicted by state evolution). Our result provides theoretical support for empirical findings that have demonstrated excellent agreement of AMP performance with state evolution predictions for moderately large dimensions.

Note: An earlier version of this work was presented at the International Symposium of Information Theory July, 2016.
Maria-Florina Balcan, Ellen Vitercik and Colin White. Learning Combinatorial Functions from Pairwise Comparisons
Abstract: A large body of work in machine learning has focused on the problem of learning a close approximation to an underlying combinatorial function, given a small set of labeled examples. However, for real-valued functions, cardinal labels might not be accessible, or it may be difficult for an expert to consistently assign real-valued labels over the entire set of examples. For instance, it is notoriously hard for consumers to reliably assign values to bundles of merchandise. Instead, it might be much easier for a consumer to report which of two bundles she likes better.

With this motivation in mind, we consider an alternative learning model, wherein the algorithm must learn the underlying function up to pairwise comparisons, from pairwise comparisons. In this model, we present a series of novel algorithms that learn over a diverse variety of widely applicable combinatorial function classes. These range from graph functions to broad classes of valuation functions that are fundamentally important in economics, the analysis of social networks, and machine learning, such as coverage, submodular, XOS, and subadditive functions, as well as functions with sparse Fourier support.

Note: This work was presented at COLT 2016 as a full research paper.
Leila Wehbe, Anwar Nunez-Elizalde, Alex Huth, Fatma Imamoglu, Natalia Bilenko and Jack Gallant. Deep multi-view representation learning of brain responses to natural stimuli
Abstract: One of the central goals of cognitive neuroscience is to understand how information is represented in the brain. Using functional neuroimaging, cognitive neuroscience researchers can study how stimulus features are mapped across the cerebral cortex. Most neuroimaging studies focus on only a single cognitive domain such as vision and language, and use relatively simple, highly controlled stimuli. Here we present a statistical model that learns a common latent “cognitive" space that accounts for results obtained in many different naturalistic neuroimaging experiments, such as watching movies or listening to stories.

Here we cast this problem as a multi-view modeling problem. In this framework, the perception of a stimulus of any type, such as a movie or a story, is an event that can be captured from multiple views. We consider two types of views: (1) views corresponding to the brain activity elicited in each one of N subjects who are exposed to that stimulus, and (2) views that represent the properties of that same stimulus in different feature spaces. Some feature spaces, such as semantic features, can be used to represent multiple types of stimuli while others, such as visual features, can be used to represent only one type.

To model these data we designed a multiview autoencoder with a shared bottleneck layer. The bottleneck layer is a low dimensional representation of the latent cognitive event, estimated from one or more input views. Our model aims to reconstruct any one view from the bottleneck layer estimate alone. Ideally, this model should allow us to predict (1) any one subject's brain activity from some stimulus features, (2) some stimulus features from one subject's brain activity, and (3) one subject's brain activity from any other subject's brain activity. Our model assimilates many common methods that are currently used to study brain representations.

We tested the model using fMRI data recorded from four subjects in two different experiments. In the first [1], the subjects watched 2 hours of natural movies. The movies were annotated with a semantic and a visual feature space. In the second experiment [2], the subjects listened to 2 hours of spoken stories. The story words were annotated with a semantic and a phonetic feature space. Our results verify that the model can predict each subject's brain activity from stimulus features, it can predict stimulus features from brain activity, and it can predict any subject's brain activity from other subjects' data.

Another important feature of the model is that it recovers an explicit representation of the underlying common latent “cognitive" space. For example, some nodes of the bottleneck layer correspond to clusters of related words and concepts, while others reflect common brain regions that represent these same concepts.

[1] S. Nishimoto, A.T. Vu, T. Naselaris, Y. Benjamini, B. Yu, and J.L. Gallant. Reconstructing visual experiences from brain activity evoked by natural movies. Current Biology, 2011.

[2] A.G. Huth, W.A. de Heer, T.L. Griffiths, F.E. Theunissen, and J.L. Gallant. Natural speech reveals the semantic maps that tile human cerebral cortex. Nature, 2016.
Yi Zhao and Xi Luo. Pathway Lasso: Estimate and Select Mediation Pathways with High Dimensional Mediators
Abstract: Causal mediation analysis is widely applied in social, economic and biological sciences to assess the effect of a treatment or exposure on an interested outcome passing through intermediate variables (mediators). It becomes increasingly popular to study the decomposition of the total treatment effect on the outcome through multiple mediation pathways. We motivate our method by studying brain pathways between brain regions after stimuli using task-related functional magnetic resonance imaging (fMRI). This setting motivates methods to address two new challenges: multiple fMRI mediators with no temporal ordering and correlated errors between mediators.

Structural equation modeling (SEM) is a popular technique to estimate mediation pathway effects. However, the estimates become unstable in the high dimensional mediator setting as all the mediators are entered as predictors in regression-type models [1]. Regularization is a general approach to solve high dimensional problems. The existing approaches do not address the problem of regularizing the effects of mediation pathways [2-3], which are commonly represented as the products of two parameters. Indeed, the product of two parameters is a non-convex function which is not considered by existing convex regularization methods including various Lasso-type penalties. In this paper, we propose a sparse mediation model using a regularized SEM approach, where sparsity means that a small number of mediators have nonzero mediation effects between a treatment and an outcome, and introduce a new convex penalty, named as Pathway Lasso, to directly regularize the pathway effects.

Our contributions in this paper are as follows: (1) this is among the first attempts to model high dimensional mediation pathways jointly; (2) we propose a general multiple mediator model with causal interpretation when the temporal order of the mediators is unknown; (3) we introduce a novel convex penalty for the non-convex function of the product, and this penalty enables simultaneous pathway estimation and selection; (4) we develop an ADMM-type algorithm for the proposed method, and study its solutions in closed form; (5) we demonstrate the robustness and advantages of our method via simulations and a publicly available fMRI dataset.

[1] Wang et al., Stat med, 2013. [2] Tibshirani, JRSSB, 1996. [3] Shojaie and Michailidis, Biometrika, 2010.
Yijun Zhao, Bilal Ahmed, Thomas Thesen, Karen Blackmon, Jennifer Dy and Carla E. Brodley. A Non-parametric Approach to Detect Epileptogenic Lesions using Restricted Boltzmann Machines
Abstract: Epilepsy is a common neurological disorder, affecting approximately 1% of the population. It is characterized by profound abnormal neural activity during seizures and interictal (between seizures) periods. Uncontrolled epilepsy can have harmful effects on the brain and has increased risk of injuries and sudden death. About one third of epilepsy patients remain resistant to medical treatment. Our research addresses the identification of lesions in the MRI’s of patients with focal cortical dysplasia (FCD), which is recognized as the most common source of pediatric epilepsy and the third most common source in adults having medically intractable seizures. Early detection and subsequent surgical removal of the FCD lesion area is the most effective and is often the last hope for these patients.

The most widely used technology in identifying the epileptic lesions is MRI coupled with intracranial EEG (iEEG). For MRI-positive patients, (i.e., patients with visible abnormal areas in the MRI), the placement of electrodes on the cortex is informed by the pinpointed problematic regions detected by visual inspection of the MRI. However, for MRI-negative patients there is no visible lesion to guide precise electrode implantation. The final target for surgical resection is based on both of these findings if available coupled with other clinical data. Consequently, the post-surgical success (i.e., seizure-freedom after surgery) ratio of MRI-positive to MRI-negative patients is 66% to 29%. Unfortunately, 45% of FCD patients are MRI-negative. For this reason, the surgical resection procedure remains highly underutilized as most practitioners are unwilling to operate in the absence of a visually detected lesion.

In this paper, we propose a non-parametric approach to detect MRI-negative epilepsy lesions using restricted Boltzmann machines (RBMs). In particular, we transform 3D-MRI images of human brains into a standard 2D surface using the Surface-Based Morphometry methodology and extract features that characterize human cortical surfaces. Our model addresses both issues of limited available features and inter-patient variabilities in the input data. For the former, we used an RBM as a pre-training step. For the latter, we applied a Dirichlet process based clustering algorithm and estimated its parameters via variational inference. To accomplish our classification task, we collect multiple classifiers by training an augmented RBM for each non-empty component from the clustering algorithm and take a majority vote among all classifiers while making a prediction. We evaluated our model on brain images of twelve MRI-negative patients. Our model correctly detected abnormal regions within the resected areas in 58% of the patients, with 99.9% accuracy of correctly classifying the non-lesional vertices. Although the detection rate (58%) may seem low, the results are significant since a board of experienced neuroradiologists failed to locate any lesion for all these patients.

Note: This paper was presented at the KDD 2016 plenary session.
Shao-Yuan Li, Yuan Jiang and Zhi-Hua Zhou. Partial Multi-View Clustering
Abstract: Real data are often with multiple modalities or com-
ing from multiple channels, while multi-view clustering
provides a natural formulation for generating clusters
from such data. Previous studies assumed that each ex-
ample appears in all views, or at least there is one view
containing all examples. In real tasks, however, it is of-
ten the case that every view suffers from the missing of
some data and therefore results in many partial exam-
ples, i.e., examples with some views missing. In this pa-
per, we present possibly the first study on partial multi-
view clustering. Our proposed approach, PVC, works
by establishing a latent subspace where the instances
corresponding to the same example in different views
are close to each other, and the instances (belonging
to different examples) in the same view are gathering
smoothly. Experiments demonstrate the advantages of
our proposed approach.
Anusua Trivedi. Re-Usability of Deep Learning Models - Transfer Learning and Fine-tuning across different domains
Abstract: INTRODUCTION: In this talk, we propose prediction techniques using deep learning on different types of images datasets – medical images and fashion images. We show how to build a generic deep learning model, which could be used with –
• A fluorescein angiographic eye image to predict Diabetic Retinopathy
• A fashion image to predict the clothing type in that image
We propose a method to apply a pre-trained deep convolution neural network (DCNN) on images to improve prediction accuracy. We use an ImageNet pre-trained DCNN and apply fine-tuning to transfer the learned features to the prediction. We use this fine-tuned model on two very different domain specific datasets. Our approach improves prediction accuracy on both domain-specific datasets, compared to state-of-the-art Machine Learning approaches.

MOTIVATION: Deep learning is an emerging field of research, which has its application across multiple domains. We try to show how transfer learning and fine tuning strategy leads to re-usability of the same Convolution Neural Network model in different disjoint domains. This work talks about re-usability of deep learning models across various domains, thus improving the development, training and deployment time in a real-world scenario. This applies to researchers from different fields, thus making this talk a generic one from which all attendees could benefit.

1. Motivation behind using Deep Learning models for Images:
Much work has been done in developing state-of -the-art machine learning algorithms and morphological image processing techniques, that explicitly extract features prevalent in images. The generic workflow used in a standard image classification technique is as follows:
• Image preprocessing techniques for noise removal and contrast enhancement.
• Feature extraction technique
• Classification/Prediction
However, these explicit feature extraction processes are very time and effort consuming. Thus, we choose to automate the image processing and feature extraction step by using DCNNs.
2. Transfer Learning & Fine-tuning DCNNs:
Current trends in the research have demonstrated that DCNNs are very effective in automatically analyzing large collections of images and identifying features that can categorize images with minimum error. Since modern DCNNs take 2-3 weeks to train across GPUs, Berkley Vision and Learning Center (BVLC) have released some final DCNN checkpoints. In this work, we use an ImageNet pre-trained DCNN - GoogLeNet.
3. Deep Learning models for Image Classification:
We fine-tune the pre-trained generic DCNN to recognize fluorescein angiography images of eyes and improve DR prediction.
4. Deep learning model for Image Caption Prediction:
We combine DCNN for fashion image classification with LSTM RNN for sequence modeling, to create a single network that generates clothing descriptions for images.

ACCEPTED TALK: This work is an accepted talk in –
• 3rd International Conference on Predictive Applications and APIs (PAPI) 2016 - http://www.papis.io/2016/program/talks/transfer-learning-and-fine-tuning-deep-convolution-neural-network-model-for-fashion-images-anusua-trivedi
• Hadoop+Strata World, Singapore, 2016 - http://conferences.oreilly.com/strata/hadoop-big-data-sg/public/schedule/detail/54278
Jörg Bornschein, Samira Shabanian, Asja Fischer and Yoshua Bengio. Bidirectional Helmholtz Machines
Abstract: Efficient unsupervised training and inference in deep generative models remains a challenging problem. One basic approach, called Helmholtz machine [1] or Variational Autoencoder [2], involves training a top-down directed generative model p(x|h) together with a bottom-up auxiliary model q(h|x). While p(x|h) allows to efficient sample the observables x given latent variables h, q(h|x) performs approximate inference over the latent variables h given an observed example x, i.e. it approximates p(h|x). Recent results indicate that better generative models can be obtained with better approximate inference procedures [3,4,5].

Instead of improving the inference procedure, we here propose a new model, the bidirectional Helmholtz machine, which guarantees that the top-down and bottom-up distributions can efficiently invert each other. We achieve this by interpreting both the top-down and the bottom-up directed models as approximate inference distributions and by defining the model distribution to be the geometric mean of these two. We present a lower-bound for the likelihood of this model and we show that optimizing this bound regularizes the model so that the Bhattacharyya distance between the bottom-up and top-down approximate distributions is minimized. This approach results in state of the art generative models which prefer significantly deeper architectures while it allows for orders of magnitude more efficient likelihood estimation.

Note: This work was presented at ICML 2016 as a full research paper [6].

[1] G.E. Hinton, P. Dayan, B.J. Frey, and R.M. Neal. The wake-sleep algorithm for unsupervised neural networks. Science, 268:1558–1161, 1995.

[2] D.P. Kingma and M. Welling. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR), 2014.

[3] T. Salimans, D.P. Kingma and M. Welling. Markov Chain Monte Carlo and Variational Inference: Bridging the Gap. arXiv:1410.6460, 2014.

[4] R.D. Hjelm, K. Cho, J. Chung, R. Salakhutdinov, V. Calhoun and N. Jojic. Iterative refinement of approximate posterior for training directed belief networks. arXiv preprint arXiv:1511.06382, 2015.

[5] Y. Burda, R.Grosse and R Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.

[6] J. Bornschein, S. Shabanian, A. Fischer and Y Bengio. Bidirectional Helmholtz Machines. Proceedings of the 33rd International Conference on Machine Learning (ICML), 2016.
Seong-Eun Moon and Jong-Seok Lee. Machine learning for brain imaging under perceptual experience of multimedia content
Abstract: As humans act as end-users of multimedia content, the ultimate goal of multimedia applications is to satisfy the users by delivering appropriate content at the right moment in a proper way. However, the problem is still unsolved: how to specify ‘appropriate content’, ‘right moment’, and ‘proper way’? A promising solution is to use brain imaging techniques such as electroencephalogram (EEG), which enable to monitor and analyze brain states to understand how users perceive multimedia content. They can be applied for content recommendation, adaptive content delivery, automatic feedback generation, content delivery for disabled people, etc. For this, it is crucial to develop machine learning systems that can recognize brain states of high-level perception from brain imaging data.
Perception of multimedia content can be defined in several ways: quality of experience (QoE), emotion, aesthetics, etc. In particular, QoE, which is more user-centric than quality of service, is regarded as a proper criterion of quality that end-users perceive. Furthermore, demand for realistic and immersive multimedia content is increasing recently. High dynamic range (HDR) imaging has been attracting much attention as a technology that can meet such end-users’ demand. In this work, we aim at classifying brain state of watching HDR videos via machine learning methods.
First, we construct a database consisting of conventional and HDR videos, EEG measured during watching the videos, and subjective ratings for contrast quality, interest, overall quality, and naturalness. We make the database publicly open so that it will be valuable for further technological development and performance benchmarking by the research community.
Second, we design machine learning systems that distinguish perceptual response of HDR videos from that of conventional videos. Spectral powers of EEG channels for each frequency band (theta, alpha, beta, and gamma) are employed as features, and neural networks and support vector machines are used as classifiers. As a result, successful classification performance is obtained for both subject-dependent classification (82.1%) and subject-independent classification (56.6%). Furthermore, we examine various connectivity features that measure the level of co-activation of different brain regions. Directed transfer function, which considers directed information flows between EEG channels, significantly improves the performance (accuracies of 93.6% and 67.9% for subject-dependent and subject-independent classifications, respectively). Particularly, the subject-independent classification performance is significantly enhanced by employing the connectivity-based features, which indicates connectivity of the brain is a relatively general representation of QoE of HDR videos.
Different neuroimaging techniques (e.g., functional near-infrared spectroscopy) and different perceptual factors (e.g., emotion, aesthetics, etc.) will be investigated as extensions of our study. In addition, deep learning methods will be adopted for enhancing the performance. Particularly, as the potential of connectivity features to enhance performance of the brain state classification was validated in our research, we are working on developing novel brain imaging data representations and machine learning techniques based on graph signal processing theory, which can effectively account for connections between different brain regions and are more informative than conventional connectivity features.
Jane Hung, Deepali Ravel, Matthias Marti and Anne Carpenter. Identifying Infected Blood Cells using R-CNN-based Object Detection
Abstract: Biology contains a multitude of problems made for object detection. Although there has been a lot of interest in deep learning based models and their success in object detection, the state of the art models from competitions like ImageNet Large Scale Visual Recognition Challenge (ILSVRC) have not yet been widely applied to biological image data. We are interested in using object detection to identify cells and recognize their categories for diseases such as malaria, where manual inspection of microscopic views by trained experts remains the gold standard. A robust solution would allow for automated single cell classification and counting and would provide enormous benefits due to faster and more accurate quantitative results without human variability.

Object detection of cells in brightfield microscopy images presents special challenges. Like natural images, microscopy images of malaria-infected blood have variations in illumination from the microscope, in cell shape, density, and color from variations in sample preparation, and have objects of uncertain class (even for experts). However, unlike natural images, there is a dearth of annotated data useful for training because of the scarcity of experts, and the class distribution is inherently highly imbalanced due to the dominance of uninfected red blood cells (RBCs).

Previous attempts to automate the process of identifying and quantifying malaria have used complex workflows for image processing and machine learning classification using features from a predetermined set of measurements (intensity, shape, and texture). However, none of these methods have gained major traction because of a lack of generalizability and difficulty of replication, comparison, and extension. Algorithms cannot be reimplemented with certainty nor extended because authors do not generally make functioning code available. Authors also rarely make their image sets available, which precludes replication of results. The lack of a standard set of images nor standard set of metrics used to report results has impeded the field.

For our task of detecting individual cells and their classes, we choose to use Faster Region-based Convolutional Neural Network (Faster R-CNN) because R-CNN and its successors have been the basis of the top performing object detection models in recent years. We use a model pre-trained on ImageNet but fine tune it with our data. As a baseline, we develop a traditional approach consisting of cell segmentation and extraction of several single-cell level features, followed by classification using random forests. To conduct our initial study, we collect and label (with the help of an expert malaria researcher) a dataset of 350 fields of view consisting of around 16,000 individual cells. We demonstrate that Faster R-CNN outperforms our baseline and put the results in context of human performance.
Corina Dima. On the Compositionality and Semantic Interpretation of English Noun Compounds
Abstract: Representing words as vectors has been a very successful enterprise in natural language processing (NLP). Based on the intuition that words appearing in similar contexts have similar meanings, software packages that produce vector space models of language like word2vec and GloVe have had a big impact on NLP research. But natural language is more than words. In this work, we address the issue of learning to compose representations for multi-word units. We focus on creating representations for English noun compounds (e.g. 'software developer') starting from the representations of their constituent words (e.g. 'software' and 'developer').

We adapt two existing composition models, the Full Additive (p=Au+Bv) and the Matrix model (p=g(W[u;v]+b)), by implementing them using neural networks. The composition models are trained to minimize the mean squared error between the composed and the corpus-induced representation of the compounds in the training set (19,054 compounds). The corpus-induced compound representations are obtained using a text corpus: an underscore is added between the constituents of the compounds in our dataset such that the two words are treated as a single unit (e.g. 'software developer' becomes 'software_developer'). The GloVe package is then used to train word representation for both the single words and the compounds.

The composition models are evaluated based on the similarity of the composed representations to the corpus-induced ones: good composed representations capture the meaning of the compound and are neighbours of the original representation, while the incorrect ones drift far away from the corpus-induced representation. The results show that the two composition models have similar results and manage to create meaningful representations for more than half of the test compounds. As expected, the composed representations do not make sense for non-compositional compounds like 'flea market'.

The composed representations should ideally capture all the relevant details with respect to the compound semantics. We use a separate task to evaluate the adequacy of the composed representations. This is a supervised classification task, where the goal is to predict the semantic relation that connects the constituents of a compound. Examples of such semantic relations are the 'part_of' relation in the compound 'finger nail' or the 'material' relation for 'iron fence'. We compare two types of neural network classifiers: (i) basic classifiers, which predict the semantic relation using only the constituent representations and (ii) composition-enhanced classifiers, which use either the composed vectors or the pre-trained composition matrices. In this setup, the composition-enhanced classifiers perform on par with the basic classifiers, despite being pre-trained in the composition process.

Both evaluations point to the need to improve composition functions, ideally by taking the semantic relations between words into account: composition should be modeled using both the constituents and their pairwise interaction patterns.

The full text of this work is available [1] .

[1] Corina Dima: On the Compositionality and Semantic Interpretation of English Noun Compounds. Proceedings of 1st Workshop on Representation Learning for NLP @ ACL 2016, Berlin, Germany.
Wacha Bounliphone. A Test of Relative Similarity for Model Selection in Generative Models
Abstract: Generative models based on deep learning techniques aim to provide sophisticated and accurate models of data, without expensive manual annotation. This is specially of interest as deep networks tend to require comparatively large training samples o achieve a good result. Model selection within this class of techniques can be a challenge, however. First, likelihoods can be difficult to compute for some families of recently proposed models based on deep learning. The current best method to evaluate such models is based on Parzen-window estimates of the log likelihood. Second, if we are given two models with similar likelihoods, we typically do not have a computationally inexpensive hypothesis test to determine whether one likelihood is significantly higher than the other. Permutation testing or other generic strategies are often computationally prohibitive, bearing in mind the relatively high computational requirements of deep networks.

To address this issue, we provide an alternative strategy for model selection, based on a novel, non-parametric hypothesis test of relative similarity. We treat the two trained networks being compared as generative models, and test whether the first candidate model generates samples significantly closer to a reference validation set. Our model selection criterion is based on the maximum mean discrepancy (MMD), which represents the distance between embeddings of empirical distributions in a reproducing kernel Hilbert space (RKHS). The maximum mean discrepancy is a metric on the space of probability distributions when a characteristic kernel is used meaning that the distribution embeddings are unique for each probability measure. Rather than train a single model using the MMD distance to a reference distribution, our goal in this work is to evaluate the relative performance of two models, by testing whether one generates samples significantly closer to the reference distribution than the other. This extends the applicability of the MMD to problems of model selection and evaluation. Key to this result is a novel expression for the joint asymptotic distribution of two correlated MMDs (between samples generated from each model, and samples from the reference distribution).

In experiments on deep generative models, including the variational auto-encoder and generative moment matching network, the tests provide a meaningful ranking of model performance as a function of parameter and training settings. We first demonstrate that the test performs as expected in scenarios where the same model is trained with different training set sizes, and the relative ordering of model performance is known. We then fix the training set size and change various architectural parameters of these networks, showing which models are significantly preferred with our test. We validate the rankings returned by the test using a separate set of data for which we compute alternate metrics for assessing the models, such as classification accuracy and likelihood.

- Link to the paper http://arxiv.org/abs/1511.04581
- This work is presented at the International Conference on Learning Representations 2016.
- Best Scientific Contribution Award (2nd ex aequo) - from University Paris-Saclay STIC Doctoral School 2016.
Luana Bulat, Douwe Kiela and Stephen Clark. Vision and feature norms: inferring attributes of words from images using cross-modal maps
Abstract: Much work in cognitive science suggests that human concept representation relies on salient attributes. Property norm datasets are constructed by asking human participants to identify the most important attributes of a concept (e.g. bananas are YELLOW, SWEET, A_FRUIT and they GROW_IN_TREES) and are widely used to test cognitive models of conceptual representation. These conceptual properties are a valuable source of semantic information, and can potentially be applied to a variety of NLP tasks, from metaphor processing to image retrieval, but are expensive to obtain because they involve intensive human annotation. The alternative to collecting more data through human annotation is to increase the coverage of property norms datasets by automatically inferring properties of concepts from easily accessible resources (e.g. textual data). In this work we follow recent trends in multi-modal semantics and explore automatic property norm learning from visual, in addition to textual, data, showing that images serve as better predictors of a concept’s properties than its textual distribution.

We construct state-of-the-art linguistic and visual representations for each of the 541 concepts in the McRae dataset. In order to learn visual representations of concepts, for each word we retrieve 10 images from Google image search, and for each image we extract the pre-softmax layer from a forward pass in a CNN that has been trained on the ImageNet classification task. The visual representation of a concept is then obtained by taking the mean of the individual image representations.

The McRae dataset can be treated as a bag of properties (i.e. a property-based semantic space), where concepts are distributed representations over properties. We use partial least squares regression (PLSR) to learn cross-modal maps to the property-based space from the visual and linguistic semantic spaces, respectively. Cross-modal maps represent a formalisation of the reference problem; for example, by inducing cross-modal maps between visual and property-based vectors we can learn which images (represented as visual vectors) refer to which concepts (represented as property-based distributed vectors). This represents an extension of the object recognition problem, since we want to associate images with semantic representations of their depicted objects, rather than just with their label.

We show that conceptual properties can be more accurately predicted through cross-modal mappings from raw perceptual information (i.e. image data) rather than from purely textual information, using a zero-shot learning task as evaluation. We also demonstrate a possible use case for property norm datasets in an image retrieval task (property-based query engine), by training the cross-modal mapping to learn which concepts refer to which images (effectively learning the inverse mapping). We show that we can reliably predict what concepts look like based on their semantic properties; for example, something that flies, has wings and a beak does indeed look like a bird.

This work was presented at NAACL HLT 2016. Full text of this work is available [1].

[1] Bulat, L., Kiela, D., & Clark, S. (2016). Vision and Feature Norms: Improving automatic feature norm learning through cross-modal maps. In Proceedings of NAACL-HLT (pp. 579-588).
Zelda Mariet and Suvrit Sra. Kronecker Determinantal Point Processes
Abstract: Determinantal Point Processes (DPPs) are probabilistic models over all subsets a ground set of N items. They have recently gained prominence in several applications that rely on "diverse" subsets. However, their applicability to large problems is still limited due to the O(N^3) complexity of core tasks such as sampling and learning. We enable efficient sampling and learning for DPPs by introducing KronDPP, a DPP model whose kernel matrix decomposes as a tensor product of multiple smaller kernel matrices. This decomposition immediately enables fast exact sampling. But contrary to what one may expect, leveraging the Kronecker product structure for speeding up DPP learning turns out to be more difficult. We overcome this challenge, and derive batch and stochastic optimization algorithms for efficiently learning the parameters of a KronDPP.
Maja Rudolph and David Blei. Exponential Family Embeddings
Abstract: (This work was accepted at NIPS 2016)
Word embeddings are a powerful approach for capturing semantic similarity among terms in a vocabulary. In this paper, we develop exponential family embeddings, a class of methods that extends the idea of word embeddings to other types of high-dimensional data. As examples, we studied neural data with real-valued observations, count data from a market basket analysis, and ratings data from a movie recommendation system. The main idea is to model each observation conditioned on a set of other observations. This set is called the context, and the way the context is defined is a modeling choice that depends on the problem. In language the context is the surrounding words; in neuroscience the context is close-by neurons; in market basket data the context is other items in the shopping cart. Each type of embedding model defines the context, the exponential family of conditional distributions, and how the latent embedding vectors are shared across data. We infer the embeddings with a scalable algorithm based on stochastic gradient descent. On all three applications - neural activity of zebrafish, users' shopping behavior, and movie ratings - we found exponential family embedding models to be more effective than other types of dimension reduction. They better reconstruct held-out data and find interesting qualitative structure.
Liran Szlak, Ohad Shamir and Jonathan Rosenski. Multi-Player Bandits - a Musical Chairs Approach
Abstract: The stochastic multi-armed bandit (MAB) problem is a classic and well-studied setting of sequential decision-making, which exemplifies the dilemma of exploration vs. exploitation. In this problem, a player sequentially chooses from a set of actions, denoted as `arms'. At every round, each arm produces a reward sampled from some unknown distribution in [0,1], and the player receives that reward, but does not observe the reward of other arms. The player's goal, of course, is to maximize the cumulative reward. The dilemma of exploration vs. exploitation here is that the more the player `explores' by trying different arms, she will have a better understanding of each machine's expected reward. The more the player `exploits' the machine which she thinks is best, the less rounds are wasted on exploring bad machines. The exploration-exploitation trade off arises in many natural settings, and optimization in the face of such settings has important application in various fields.

We consider a variant of the stochastic multi-armed bandit problem, where multiple players simultaneously choose from the same set of arms and may collide, receiving no reward. This setting has been motivated by problems arising in cognitive radio networks, where several users utilize the same set of channels, in a situation where the quality of the different channels varies, and direct coordination between the players is not possible. This setting is especially challenging under the realistic assumption that communication between players is limited. We focus on a particularly challenging situation, where the players cannot communicate, there is no central control, and the players cannot even know how many other players are also participating.

Our goal is to find a distributed algorithm for players that will maximize the sum of their rewards. We use the standard notion of (expected) regret to measure our performance, namely the difference between the expected cumulative reward of the arm with highest mean reward, and the expected cumulative rewards of the players.
In this work we provide a communication-free algorithm (Musical Chairs) which attains constant regret with high probability, as well as a sublinear-regret, communication-free algorithm (Dynamic Musical Chairs) for the more difficult setting of players dynamically entering and leaving throughout the game. Moreover, both algorithms do not require prior knowledge of the number of players. To the best of our knowledge, these are the first communication-free algorithms with these types of formal guarantees. We also rigorously compare our algorithms to previous works, and complement our theoretical findings with experiments

Note: This work was presented at ICML 2016 as a full research paper
Jiaqian Yu and Matthew Blaschko. Efficient Learning for Discriminative Segmentation with Supermodular Losses
Abstract: Several non-modular loss functions have been considered in the context of image segmentation. These loss functions do not necessarily have the same structure as the segmentation inference algorithm, and in general, we may have to resort to generic submodular minimization algorithms for loss augmented inference. Although these come with polynomial time guarantees, they are not practical to apply to image scale data.
In this work, we first propose a supermodular loss function that is itself optimizable with graph cuts. It counts the number of incorrect pixels plus the number of pairs of neighboring pixels that both have incorrect labels. Maximization of this loss function is equivalent to a supermodular function maximization problem. It emphasizes the importance of correctly predicting adjacent groups of pixels, e.g. those present in thin structures more than one pixel wide. The loss function maps to a set function by considering the set of pixels that are incorrectly labeled, while the inference maps to a set function by considering the set of pixels that are labeled as foreground.The incorporation in a joint loss-augmented inference leads to non-submodular potentials.
We therefore use the alternating direction method of multipliers (ADMM) based decomposition strategy. It consists of alternatively optimizing the loss function and performing Maximum a Posteriori (MAP) inference, with each process augmented by a quadratic term enforcing the labeling determined by each to converge to the optimum of the sum. In general, we simply need task-specific solvers for two subproblems, which need not use a single graph cut algorithm and can therefore exploit any available structure. In this way, we gain computational efficiency, making new choices of loss functions practical, while simultaneously making the inference algorithm employed during training closer to the test time procedure.
We show improvement both in accuracy and computational performance on the Microsoft Research GrabCut database and a brain structure segmentation task. We show that: (i) our proposed splitting strategy is orders of magnitude faster than the minimum
norm point algorithm; (ii) our strategy yields results nearly identical to a LP-relaxation while being much faster in practice; and (iii) training with the same supermodular loss as during test time yields better performance. Qualitative segmentation results show that our 8-connected loss achieves better performance on the foreground/background boundary, as well as on elongated structures of the foreground object, such as the head and legs, especially when the appearance of the foreground is similar to the background.
We empirically validate the use of a supermodular loss during training and the improved computational properties of the proposed ADMM approach over the Fujishige-Wolfe minimum norm point algorithm.We envision that this can be of use in a wide range of application settings, and an open source general purpose toolbox for this efficient segmentation framework with supermodular losses is available for download from https://github.com/yjq8812/efficientSegmentation.

Note: This work is accepted at BMVC2016 for oral presentation.
Ekaterina Kochmar. Assessment of Lexical Knowledge Acquisition by Non-Native Speakers of English
Abstract: In this work, we use machine learning algorithms to assess the process of lexical knowledge acquisition. In particular, we study how open class words – nouns (e.g., 'solution'), verbs (e.g., 'make') and adjectives (e.g., 'strong') – are selected and combined in a meaningful way that preserves communicative intent of the speaker. The clear-cut cases illustrating the challenges involved in this process are first language acquisition by children and second language acquisition by non-native speakers. In the current work, we focus on the latter.

We address this task from three different perspectives and implement systems aimed at detecting anomalies in lexical choice accordingly:

(1) First, we study how lexical knowledge is acquired in general. For example, what makes combinations like 'strong tea' and 'find solution' semantically appropriate and *'powerful tea' and *'find decision' anomalous? To this end, we implement System1 that models word choice using distributed vector representations learned with word2vec for individual words within combinations (e.g., 'find', 'tea', 'powerful'), and applies compositional functions to derive phrase-level vector representations. Next, we apply a supervised machine learning classifier that uses features based on the properties of the semantic phrase vectors and detects anomalies in the word choice. In our experiments, random forests show superior performance compared to a number of other classifiers and achieve state-of-the-art results.

(2) Next, we look into other factors that influence the word choice. In particular, we process texts written by non-native speakers of English and investigate to what extent the lexico-semantic models of their native languages (L1) vary and interfere with the word choice in the second language (L2). For example, the concept of pulling the trigger (L2=English) is expressed as *'pushing the trigger' in some other languages (including L1=Russian and L1=Spanish). This exemplifies the differences in the lexico-semantic models across languages and causes errors in lexical choice in L2. System2 is based on the hypothesis that these differences express themselves in the strength of lexical association in L1 and L2: for example, *'find decision' and 'find solution' both have high association scores in Russian, but only the latter also has high association score in English.
We demonstrate that System2 that incorporates the information about lexico-semantic models improves the performance on this task.

(3) Finally, we explore to what extent the lexical choice is influenced by contextual fit. For example, *'find decision' appears to be anomalous when used in isolation but not when it is used in a context like 'She found his decision unreasonable'. System3 models the context using recurrent neural network language model (RNNLM) and derives the features based on the scores output by the RNNLM for the classifier that decides whether the phrase fits the context surrounding it. We find that using the contextual information through RNNLM improves the results of the System1 and System2.
Ariel Herbert-Voss, Gregory A Ciccarelli, Thomas F Quatieri and Christopher J Smalt. A biologically-informed deep learning approach to speech signal processing
Abstract: Speech is an acoustic signal transduced into an electric signal by hair cells in the cochlea. The fidelity of transduction depends critically on these hair cells, and their damage results in hearing loss. Our goal is to develop a neuro-biologically plausible model of auditory processing that can simulate this damage, to study its effects on the representation of speech in the auditory system. In particular, speech intelligibility is thought to rely heavily on formants [1], spectral peaks produced by time-varying resonances of the vocal tract, which determine vowel quality. To better understand the effects of hair cell damage on formant representation, we leverage deep learning algorithms for modeling hierarchical sensory processing in the brain to create a biologically-informed model of hearing loss.
Recent developments in computational neuroscience have made strides using hierarchical deep neural networks to illuminate the development and organization of sensory cortical processing [2]. There also exists prior work in estimating formants using deep learning [3]. We extend these approaches to develop a framework to model sensorineural hearing loss. Our primary contribution is a model of auditory processing informed by neurobiology that can simulate both normal and hearing-impaired listeners.
Analogous to the physiological filter in the auditory system, our model uses a gammatone filterbank to represent cochlear output across the auditory nerve. This allows us to simulate hearing damage by modifying the gain and tuning of these filters to reflect cochlear hair cell loss. We then feed this representation into a fan-out fully-connected deep neural network to represent hierarchical processing in the auditory system. The network is trained as a regression model on a pre-labeled corpus of read speech [4]. We benchmarked with RMS error under simulated normal hearing conditions and found levels of performance comparable to state-of-the-art formant tracking algorithms, indicating the reliability of our model in simulating human auditory processing of speech.
[1] Carney LH, Li T, McDonough JM. “Speech coding in the brain: Representation of vowel formants by midbrain neurons tuned to sound fluctuations.” eNeuro. 2015 2(5) Epub 2015 Jul 20.
[2] Yamins DLK, DiCarlo JJ. “Using goal-driven deep learning models to understand sensory cortex.” Nature Neuroscience 19 256-365 (2016)
[3] Dissen Y, Keshet J. “Formant estimation and tracking using deep learning.” The 17th Annual Conference of the International Speech Communication Association (Interspeech), San Francisco, CA, 2016.
[4] Deng L, Cui X, Pruvenok R, Huang J, Momen S, Chen Y, Alwan A. “A database of vocal tract resonance trajectories for research in speech processing.” Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Toulouse, France, 2006.
Remi Munos, Tom Stepleton, Anna Harutyunyan and Marc Bellemare. Safe and Efficient Off-Policy Reinforcement Learning
Abstract: One fundamental trade-off in reinforcement learning lies in the definition of the update target: should one estimate Monte Carlo returns or bootstrap from an existing Q-function? Return-based methods (where return refers to the sum of discounted rewards) offer some advantages over value bootstrap methods: they are better behaved when combined with function approximation, and quickly propagate the fruits of exploration. On the other hand, value bootstrap methods are much more readily applied to off-policy data: an essential use case, in which the behavior and target policies are disambiguated. In this work we show that learning from returns need not be at cross-purposes with off-policy learning.

We start from the recent work of Harutyunyan et al. [1] who show that naive off-policy policy evaluation that uses approximate values to correct for the ``off-policyness'' of a trajectory, still converges to the desired value function provided the behavior and target policies are not too far apart (where the maximum allowed distance depends on the “return-length” parameter, $\lambda$). Unfortunately, the assumption that the policies are close is restrictive, as well as difficult to uphold in the control case, where the target policy is always greedy with respect to the current Q-function. In that sense this algorithm is not safe: it does not handle the case of arbitrary ``off-policyness''. Alternatively, the Tree-backup (TB) algorithm [2] tolerates arbitrary target/behavior discrepancies by scaling information from future temporal differences by the product of target policy probabilities. Tree-Backup is not efficient in the ``near on-policy'' case (similar behavior and target) though, as traces may be cut prematurely, preventing learning from full returns.

We formulate and analyze a common form for these and other off-policy, return-based algorithms. From this common form, we derive the improved algorithm, Retrace($\lambda$), which is both safe and efficient, enjoying convergence guarantees for off-policy policy evaluation and -- more importantly -- for the control setting. To our knowledge, this is the first online return-based off-policy control algorithm which does not require the GLIE (Greedy in the Limit with Infinite Exploration) assumption. As a corollary, we prove convergence of the popular Watkins’s Q($\lambda$) [3], which was still an open problem. Finally, we illustrate the benefits of Retrace($\lambda$) on a standard suite of Atari 2600 games in the context of replaying multi-step sequences of experience.

Note: This work will be presented at the main track of NIPS 2016. Preprint available at: https://arxiv.org/abs/1606.02647.

[1] Harutyunyan A., Bellemare M.G., Stepleton T., and Munos R. (2016). Q($\lambda$) with off-policy corrections. In Proceedings of the 27th International Conference on Algorithmic Learning Theory (ALT) (To Appear).
[2] Precup, D., Sutton, R. S., and Singh, S. (2000). Eligibility traces for off-policy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning.
[3] Sutton, R. and Barto, A. (1998). Reinforcement learning: An introduction, volume 116. Cambridge University Press.
Tatiana Shpakova and Francis Bach. Parameter Learning for Log-supermodular Distributions
Abstract: We consider log-supermodular models on binary variables, which are probabilistic models with negative log-densities which are submodular. These models provide probabilistic interpretations of common combinatorial optimization tasks such as image segmentation. In this paper, we focus primarily on parameter estimation in the models from known upper-bounds on the intractable log-partition function. We show that the bound based on separable optimization on the base polytope of the submodular function is always inferior to a bound based on ''perturb-and-MAP'' ideas. Then, to learn parameters, given that our approximation of the log-partition function is an expectation (over our own randomization), we use a stochastic subgradient technique to maximize a lower-bound on the log-likelihood. This can also be extended to conditional maximum likelihood. We illustrate our new results in a set of experiments in binary image denoising, where we highlight the flexibility of a probabilistic model to learn with missing data.
Swabha Swayamdipta, Miguel Ballesteros, Chris Dyer and Noah A. Smith. Multi-Task Learning for Incremental Parsing using Stack LSTMs
Abstract: Syntactic and semantic parsing of natural language text are indispensable for language understanding applications such as question answering and information extraction. Typical outputs for such parsers are labeled, directed trees or graphs, with words as vertices and edges encoding syntactic and semantic relations, also called dependencies. Considerable research has sought accurate and fast parsers that work well across text genres and languages.

Incremental parsers process text one word at a time, from left to right, at each step predicting syntactic or semantic relations (edges). They are attractive because their runtime is linear in the length of the input text. These algorithms make use of stack data structures to track the algorithm state, with each iteration cast as a classification decision among predefined operations that manipulate the stacks. The same algorithm state can be used to predict both syntactic and semantic decisions jointly (Henderson et al., 2008), with parameters estimated via multi-task learning.

Concretely, local incremental parsing decisions rely on features of the stack data structures that store partially-built structures, as-yet-unprocessed words, and the history of past decisions. Traditionally, these features are discrete, and manually designed (Nivre, 2008; Henderson et al., 2008), and linear classifiers or feedforward neural networks are applied. Manual feature design relies on linguistic intuition and careful ablation studies, and such systems tend to rely on only a small subset of the algorithm state captured in the top few items on the stack data structures. Further, in a multi-task setup, where predictions about syntax and semantics are allowed to interact, intuition serves less well.

We therefore propose a deep architecture using stack long short-term memories (Dyer et al., 2015), a variant of recurrent neural nets which are designed to handle discrete stack data structure operations. Stack LSTMs discover a representation of the entire algorithm state and the entire decision history by decomposing each into smaller pieces that are combined by recursive function evaluations - just how a stack is built by the "push" operation applied to an element and a smaller stack. Information further down the stack is not lost, but rather made available to help guide the model in its next decision. Our architecture is thus able to automatically extract features helpful for multi-task learning, obviating the need for manual feature design. Finally, we adopt a greedy prediction policy, which is both fast and simple to implement.

We obtain the best published parsing performance among models that jointly learn syntax and semantics, on CoNLL 2008–9 (Surdeanu et al., 2008; Hajič et al., 2009) English benchmarks. For the multilingual shared task at CoNLL 2009, our average performance is at par with the best-reported results.
Fariba Yousefi, Zhenwen Dai, Carl Henrik Ek and Neil Lawrence. Unsupervised Learning with Imbalanced Data via Structure Consolidation Latent Variable Model
Abstract: In many medical applications, e.g. pathology, negatively labelled data is extremely easy to obtain (e.g. healthy cells). Positive labels, on the other hand, can be harder to acquire (e.g. particular disease morphologies). These massively unbalanced problems are challenging for most algorithms because the negative class tends to dominate the objective function and the resulting model performs poorly. In practice it is often better to throw away much of the negative data and rebalance the data set. Unsupervised learning has been attracting a lot of attention as it has the potential to serve as an underpinning technology for a range of challenges such as generative modeling, missing data imputation and coping with multiple data modalities. Unsupervised learning can also be applied to a wider range of data sets, because it does not rely on having carefully labelled data available. In this work we explore the possibility of using a variant unsupervised learning algorithm to solve the problem of label balance. We build latent variable models that can simultaneously accommodate a very large number of negative examples, sharing their characteristics appropriately with the positive class, while simultaneously allowing the model to characterise the manner in which the positive class is differently characterised through preserved (or private) latent spaces that are separately learned for each class. The resulting model does not suffer from the standard challenges in this domain. We compare with a variant of the discriminative GP-LVM (the model that underpinned GaussianFace) and show significantly improved performance. Our probabilistic latent variable model divides its latent space into a shared space of all the categories and a private space for each category. The shared space accounts for capturing the common regularities among categories (e.g. positive and negative class) and the private space is dedicated to model the variance specific to individual categories. Because the modelling of the private space is category specific, there is no domination of it’s characteristics by the larger category. Thus the data in each category can be modelled appropriately while the common regularities are still exploited.

We implement the idea of shared and private space in the framework of Gaussian Process Latent Variable Models by deriving a particular covariance function (kernel) that enables such separation. We exploit closed form variational lower bounds of the log marginal likelihood of the proposed model, which to provide an efficient approximation inference method. The performance of our model is evaluated with a real image dataset, in which the positive and negative data are extremely imbalanced. We show our model still can learn from imbalanced data and perform well in both generative and discriminative tasks. Beyond the given example, the ability of jointly modelling multiple data categories and handling imbalanced datasets can be linked to many other areas such as transfer learning.

Full text of this work is available at https://arxiv.org/abs/1607.00067.
Aijan Ibraimova, Michael Figurnov, Dmitry Vetrov and Pushmeet Kohli. Combination of methods for acceleration of Convolutional Neural Networks
Abstract: Convolutional neural networks (CNNs) are a hot topic in the field of machine learning nowadays. They are widely used in solving vision problems. CNNs have already achieved impressive results not only in classification and segmentation but also in other real-world problems. For instance, CNNs are employed to capture more low-level representation (on a character level) in the recent works in Natural Language Processing.

In many cases, the computational complexity of the neural network and, consequently, its training and test times increase together with its performance. While CNNs demonstrate significant advantages, the approach is hard to apply in the case of restricted computational resources. Therefore, acceleration of neural networks will open a broad perspective in computer vision and other fields. It will also allow us to use such models in real-time applications or devices with limited power consumption, e.g. mobile phones. The aim of this work is to accelerate the test-time computation of convolutional neural networks.

Convolutional neural network’s parameters are often redundant, The number of calculations inside the network can be reduced by wisely selecting the influential parameters and removing the other ones. This work describes in detail two previously proposed ways of accelerating convolutional neural networks: the first one [1] is based on the assumption about the low-rank behavior of the response of a convolutional layer while the second one [2] spatially samples the responses of a convolutional layer. Under the first assumption, a convolutional layer can be decomposed into two layers with lower total complexity. The second method computes only the most important outputs of a convolutional layer. All the other elements’ values are interpolated by the value of the nearest neighbor. These two methods exploit different types of redundancy, and, therefore, provide a huge prospective for the combination.

We propose to combine these methods. The composition takes into account nonlinear units, spatial redundancy of input images, individual reconstruction, and redundancy of every layer; it also considers which elements of the layer influence other layers along with other CNN nuances. It focuses on optimizing all convolutional layers. We also propose a novel approach of decomposing a layer using batches which can be used in machines with RAM restrictions.

Final approach for accelerating convolutional neural networks can simmultaneously reduce the time needed for computing images 2-4 times and the memory consumption of the net with a negligible drop in accuracy. The experiments conducted on the widely used CNNs architectures corroborate the advantages of the proposed method over the other approaches. Results of discussed methods and their combination are presented and analyzed in this work.

[1] Xiangyu Zhang, Jianhua Zou, Kaiming He, and Jian Sun. Accelerating very deep convolutional networks for classification and detection. CoRR, abs/1505.06798, 2015
[2] Michael Figurnov, Aijan Ibraimova, Dmitry P. Vetrov, and Pushmeet Kohli. Perforated CNNs: Acceleration through elimination of redundant convolutions. Advances in Neural Information Processing Systems, 2016
Anna Bethke, Alexander Gude, Tiffany Jaya and Abhinav Ganesh. Hermes: A Modular Recommender System Framework
Abstract: Data scientists and software engineers are spending an increasing amount of time trying to locate the relevant datasets, software packages, and tools necessary for their jobs. And while new tools and computer programs are also being developed to ingest and understand this data, their existence is often not known outside of their originating group. Moreover, developers and analysts can be isolated from one another, leading to an analytical gap. To help organizations close their own gaps, we built a recommender system framework in PySpark named Hermes. Hermes is a recommender system comparison tool where developers can easily test a variety of algorithms on a wide range of performance metrics.

Commercial companies have started to address analytical gaps by providing organizations with tools to lead developers to the datasets, files, or coding notebooks they may find most useful. However, there are three areas in which these systems could be improved. First of all, to our knowledge, these systems are often either based on an item’s popularity, user supplied tags and inputs, or solely on collaborative filtering techniques – ignoring the content of the items. Secondly, these systems are not generally extendible as each organization will have somewhat different interaction patterns (e.g., level of user-object density), types of objects (e.g., analytical tools, data storage systems, or dataset types) and use cases from each other (e.g., require greater accuracy or diversity). In addition, recommender systems are often designed to optimize accuracy alone without considering the recommended item’s diversity or novelty, producing recommendations that may not be useful to an end user.

To address these shortcomings, Hermes supports a variety of different recommender system algorithms, datasets and performance metrics. To test the framework, we evaluated nine recommender system algorithms on ten different datasets using fifteen performance metrics. The initial datasets cover both traditional (i.e. movies, music, and book reviews) and non-traditional applications (i.e. code repository, and geospatial data interactions) to better reflect the range of potential input datasets. Because we have standardized the format of the datasets, multiple collaborative filtering algorithms and content based algorithms can easily be assessed. To our knowledge, Hermes contains the first open source content based algorithms. The algorithms return predicted ratings in a standardized form which feed into multiple performance metrics including RMSE, MAE, and Precision@N, along with multiple measures of coverage, diversity, novelty, and serendipity.

Our contribution to this workshop will be to introduce the component parts of the Hermes framework and in particular describe how it can be individualized to test new input datasets or recommender system algorithms. Further, we have shown that the content of code can be determined using standard natural language processing techniques such as word2vec. We found that predicting a user’s interactions through a content based algorithm was generally better than using a collaborative filtering technique. In the coming months, we will be investigating if these findings transfer to additional code repositories, as well as implementing additional recommender system algorithms. The source code for this project is located at https://github.com/Lab41/hermes.
Ji Hyun Bak, Jung Yoon Choi, Athena Akrami, Ilana Witten and Jonathan Pillow. Adaptive optimal training of animal behavior
Abstract: Neuroscience experiments often require training animals to perform tasks designed to elicit various sensory, cognitive, and motor behaviors. Training typically involves a series of gradual adjustments of stimulus conditions and rewards in order to bring about learning. However, training protocols are usually hand-designed, relying on a combination of intuition, guesswork, and trial-and-error, and often require weeks or months to achieve a desired level of task performance. Here we combine ideas from reinforcement learning and optimal experimental design to formulate methods for adaptive optimal training of animal behavior. Our work addresses two intriguing problems at once: first, it seeks to infer the learning rules underlying an animal's behavioral changes during training; second, it seeks to exploit these rules to select stimuli that will maximize the rate of learning toward a desired objective. We develop and test these methods using data collected from rats during training on an auditory discrimination task. We show that we can accurately infer the parameters of a policy-gradient-based learning algorithm that describes how the animal's internal model of the task evolves over the course of training. We then formulate a theory for optimal training, which involves selecting sequences of stimuli that will drive the animal's internal policy toward a desired location in the parameter space. Simulations show that our adaptive training method can achieve a substantial speedup over standard training methods. These results will hold broad theoretical interest for researchers in reinforcement learning, and offer immense practical benefits to neuroscientists tasked with training animals.
Christina Lee, Yihua Li, Devavrat Shah and Dogyoon Song. Blind Regression: Nonparametric Regression for Latent Variable Models via Collaborative Filtering
Abstract: We introduce the framework of Blind Regression motivated by the problem of Matrix Completion for recommendation systems: Given n users and m movies, the goal is to predict the unknown rating of a user for a movie using known observations, i.e., to complete the partially observed matrix. Following the framework of non-parametric statistics, we posit that user u and movie i have features x_1(u) and x_2(i) respectively, and their corresponding rating y_{ui} is given by f(x_1(u), x_2(i)) for some unknown function f. In contrast with classical regression, the features x = (x_1(u), x_2(i)) are not observed, and thus we are not able to compute the distances between the input points, making it challenging to apply standard regression methods to predict the rating for an unknown user-movie pair.

The widely studied low rank matrix assumption imposes that x_1(u) and x_2(i) are points within the d-dimensional Euclidean space, and the function f is the inner product of x_1(u) and x_2(i). However, this is quite restrictive, as the low rank assumption would be violated simply if the observed matrix was obtained via a nonlinear measurement function applied entry-wise to the corresponding inner products. Inspired by the classical Taylor's expansion for differentiable functions, we provide a prediction algorithm that is consistent for all Lipschitz continuous functions. In fact, the analysis through our framework naturally leads to a variant of collaborative filtering, and hence, our result sheds insight into explaining the widespread sucess of collaborative filtering in practice. Assuming each entry is sampled independently with probability at least O(\max(m^{-1/2 + \delta},n^{-1 + \delta})) with delta > 0, we provide bounds on the mean squared error that goes to zero with n.

Experiments with the MovieLens and Netflix datasets suggest that our algorithm provides principled improvements over basic collaborative filtering and matrix factorization methods. Since the framework allows for generic Lipschitz functions and generic latent metric spaces, the algorithm and analysis naturally extend for the setting of higher order tensors by simply flattening the tensor into a matrix by combining two coordinates into one. We apply our method to the Tensor completion problem for image reconstruction, showing that our simple and principled approach is competitive with existing Tensor completion heuristics.

This work will be presented at NIPS 2016.
Nidhi Gupta, Ayan Seal, Pushpraj Bhatele and Pritee Khanna. Selective Block Based Approach for Neoplasm Detection from T2-Weighted Brain MRIs
Abstract: A realistic challenge in neuroanatomy is to assist radiologists to detect the brain neoplasm at an early stage. This paper presents a fast and accurate Computer Aided Diagnosis (CAD) system based on selective block based approach for neoplasm (tumor) detection from T2-weighted brain MR images. The salient contribution of the presented work lies in a fast discrimination using selective block based approach. Local binary patterns are considered as features, which are trained by support vector machine. The experiments are performed on the dataset of 100 patients, in which 55 patients reported with brain tumor and rest as normal. The proposed CAD system achieves 99.67% accuracy with 100% sensitivity. The comparative studies on the same dataset report the out performance of proposed CAD system by comparison with some of the existing system.
Kristina Preuer, Günter Klambauer and Sepp Hochreiter. Deep Learning for predicting synergy effects of drug combinations
Abstract: Drug combination therapies have numerous advantages compared to mono therapies and are commonly used for cancer treatment. If the right drugs are combined it is possible to reduce the acquisition of drug resistance, lower the used doses and to achieve a higher efficiency. However, the huge search space of possible combinations as well as the lack of effective experimental procedures make synergy research extremely challenging. Nonetheless, understanding the interactions and effects of drug combinations is crucial for proper treatment.

We solve this multi-modal data source problem using a deep neural network which is capable of predicting the synergy for drug combinations on different cancer cell lines. The neural network combines structural compound information as well as cellular and genetic data of the cancer cell lines for its predictions. Therefore, the model can integrate both pharmaceutical and patient information, which is a huge step towards personalized medicine. The model can be used for predicting the synergy effect of a drug combination for a certain cell line (patient) as well as for finding synergistic drug combinations which can be used for a broad range of cell lines (patients).
Mariella Dimiccoli, Haoyi Xu and Petia Radeva. A cognitive-based model for event segmentation
Abstract: Perceptual information from unconstrained videos, and more in general from the world in which we act, arrives to our brain in a continuous manner over time by our visual system. In spite of that, instead of a continuous visual stream, we do experience this visual information as a sequence of coherent and bounded perceptual units, called events in the cognitive literature [1]. Event representation is currently an active area of research in both computer vision and neuroscience. In computer vision, event representation is tightly coupled to event detection and recognition, that are crucial to automatically indexing and retrieving the growing size of today's available videos on internet. In neuroscience, the mechanism underlying event representation and learning are still not well understood. Classically, the surprise at the occurrence of an unpredicted observation has been considered the major cue for event segmentation. However, recent experimental findings [2] have shown that neural representation of events are not tied to predictive uncertainty, but arise from temporal community structures: items that share the temporal context are grouped together in a representational space.

Inspired by these findings, we investigate a new cognitive based model for event segmentation in photo-streams (2fpm) captured by a wearable camera, which exploits this pattern of temporal overlap for dividing an unconstrained stream of images into events. We first detect concepts in each image separately by employing a convolutional neural network approach and later, by leveraging WordNet, we cluster the detected concepts in a semantic space, hence defining a vocabulary of concepts. Each image is therefore represented by a concept vector, whose elements indicate the confidence with which each concept of the vocabulary is detected in the image. Later, by relying on this semantic representation, we train a feed-forward neural network to predict which concept vector would occur in the next image. To simulate the concept vector sequence, we included a number of localist units equal to the number of concept vectors in both the input and output layers. The model modifies connections weights from the current item layer to learn to activate only the possible possible successors for a given concept vector. After training, we use the trained neural network as a fixed predictor and we associate to each image the activation of the hidden layer of its corresponding concept vector after exposure. Finally, we use these activation vectors as new image representation and we apply a hierarchical merging algorithm to find clusters corresponding to events.

Our preliminary results indicate that the newly learned representational space is more suited than a semantic space for event segmentation. The proposed method is similarly accurate as [3] but it involves less parameters and contrarily to [3], which also relies on contextual information, solely rests upon semantic features. Furthermore, while [3] uses a complex clustering method combining the advantages of agglomerative clustering and of a statistical change detector, the proposed method employs a simple hierarchical merging algorithm based on a binary partition tree representation of the photo-streams. Therefore, there is still room for improvement.
Gintare Karolina Dziugaite, Daniel Roy and Zoubin Ghahramani. Neural Network Matrix Factorization
Abstract: Data often comes in the form of an array or matrix. Matrix factorization techniques attempt to recover missing or corrupted entries by assuming that the matrix can be written as the product of two low-rank matrices. In other words, matrix factorization approximates the entries of the matrix by a simple, fixed function---namely, the inner product---acting on the latent feature vectors for the corresponding row and column. Here we consider replacing the inner product by an arbitrary function that we learn from the data at the same time as we learn the latent feature vectors. In particular, we replace the inner product by a multi-layer feed-forward neural network, and learn by alternating between optimizing the network for fixed latent features, and optimizing the latent features for a fixed network. We call the resulting approach "neural network matrix factorization" or NNMF, for short. We perform a number of experiments on synthetic datasets designed to identify the strengths and drawbacks of NNMF and other existing techniques for modelling relational data. On real datasets, NNMF dominates standard low-rank techniques on a suite of benchmark. We also compare NNMF to two state-of-the-art techniques---LLORMA and AUTOREC---and demonstrate situations where each system dominates.

See the full text at: https://arxiv.org/abs/1511.06443
Bowei Yan and Purnamrita Sarkar. Convex Relaxation for Community Detection with Covariates
Lydia Liu, Urun Dogan and Katja Hofmann. Decoding multitask DQN in the world of Minecraft
Abstract: Deep Q-Networks (or DQN) has been hailed as a general-purpose deep reinforcement learning (RL) algorithm for training agents to solve tasks from only raw visual input [1]. Recent papers use DQN and DQN-derived methods to train not only networks that attain superhuman performance in 2D Atari games, but also multitask networks that can play multiple such games at once [2,3]. A natural question is: Were these multitask networks able to exploit the high level similarity between games and learn common representations of their game states? Such power of analogizing and concept sharing across tasks is essential for multitask agents to accomplish life-long learning of a growing number of tasks efficiently. 

However, these multi-task networks all learned fully separable first convolution layer and last hidden layer representations for states from different games, showing no evidence of meaningful concept sharing between games [3]. One reason could be that the Atari games are visually too different and failed to promote representation sharing during learning. 

In our work, we show that with sufficiently related tasks, we can train a multitask extension of DQN (MDQN) to learn shared representations, while achieving expert performance in all the tasks it has been trained on. In particular, we construct such a set of tasks in Minecraft, a complex 3D world, and run experiments using the open-source Malmo platform [4]. 

Our contributions are three-fold: 

First, we introduced a novel set of 3D navigation tasks, 'Room Maze', that is designed for studying multitask learning of shared representations. The tasks vary, e.g. the action space, while keeping visual aspects constant. We show that DQN is capable of achieving expert performance in these tasks.

Second, we investigated a multitask deep RL approach called MDQN to understand whether multitask learning can be effective when tasks are sufficiently related. We demonstrate that multitask learning was indeed effective on Room Maze tasks.

Third, we demonstrate a multitask agent that learns shared semantic representations of two tasks with different difficulty level and state spaces. That is, it maps states from different tasks into the same points in space based on the conceptual similarity of the states, rather than which task they are from. To our knowledge, this is the first such finding in the deep RL literature. To analyze the learned representations, we extend recent visualization ideas from deep supervised learning and deep RL. Our analysis provides interesting insights on the nature of multitask representations.


[1] V. Mnih, K. Kavukcuoglu, D. Silver, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 02 2015. 

[2] E. Parisotto, L. J. Ba, and R. Salakhutdinov. Actor-mimic: Deep multitask and transfer reinforcement learning. In International Conference on Learning Representations (ICLR), 2016.

[3] A. Rusu, S. Colmenarejo, Ç. Gülçehre, et al. Policy distillation. In International Conference on Learning Representations (ICLR), 2016.

[4] Matthew Johnson, Katja Hofmann, Tim Hutton, David Bignell. The Malmo Platform for Artificial Intelligence Experimentation. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, (IJCAI), pages 4246-4247, 2016.
Jessica Hamrick and Thomas Griffiths. Metareasoning and mental simulation
Abstract: Any agent---whether human or artificial---must be able to competently use and allocate its computational resources in order to effectively navigate the world. How should an agent perform this allocation? Specifically, on what tasks should an agent use which resources? What should the agent think about to perform the task? How much time should an agent spend thinking, rather than acting? To gain insight into the answers to these questions, we asked: how do people make the best use of their cognitive resources? We focus in particular on the domain of "mental simulation", which is analogous to the idea of a rich, generative model; this type of computation is particularly interesting in that the amount of time spent computing determines the quality of the answer. Through a series of behavioral experiments combined with machine learning models, we show how people adaptively use their "mental simulations" to learn new things about the world, that they choose which simulations to run based on which they think will be more informative, and that they allocate their cognitive resources to spend less time on easy problems and more time on hard problems.

To investigate how people learn new things from mental simulation, we asked how people learn about physical object properties such as mass. We ran a series of experiments that demonstrate people's capacity to reason about the relative masses of objects and explain our results through a model that combines Bayesian inference with approximate knowledge of Newtonian physics with Monte Carlo estimates of probabilities from noisy physical simulations. We find that this model predicts judgments from our experiments, suggesting that the same simulation mechanism underlies both peoples' predictions and inferences. [1]

To understand which simulations people run, we focused on the classic task of "mental rotation" in which people imagine a pair objects under different rotations in order to determine whether they are the same object or not. We formulated a model of mental rotation based on Bayesian quadrature to show that people choose to imagine the rotations that will maximally decrease the variance in their estimate of the answer. [2]

Finally, to explore the question of how many simulations people run, we designed a task in which people had to judge whether a ball bouncing in a box would go through a hole or not. We combined a model of noisy physical simulation with a decision making strategy called the sequential probability ratio test (SPRT), which predicted that people should use more samples when it is harder to make an accurate prediction due to higher simulation uncertainty. Our experimental results were well-predicted by our model, demonstrating that people have a systematic strategy to allocate resources for mental simulation. [3]

[1] Hamrick, Battaglia, Griffiths, & Tenenbaum (2016). Inferring Mass in Complex Scenes by Mental Simulation. Cognition.
[2] Hamrick & Griffiths (2014). What to simulate? Inferring the right direction for mental rotation. CogSci 2014.
[3] Hamrick, Smith, Griffiths, & Vul (2015). Think again? The amount of mental simulation tracks uncertainty in the outcome. CogSci 2015.
Manasi Vartak. ModelDB: A System for Machine Learning Model Management
Abstract: Building a machine learning model is an iterative process. A data scientist will build many tens to hundreds of models before arriving at one that meets some acceptance criteria (e.g. AUC cutoff, accuracy threshold). However, the current style of model building is ad-hoc and there is no practical way for a data scientist to manage models that are built over time. This gives rise to three types of problems: (1) Reproducing models and results is excessively time-consuming or at times infeasible; (2) Data scientists must “remember” results and parameters of previous versions of a model; (3) Data scientists have no means of answering aggregate or example-level questions regarding different versions of a model. These challenges resulting from the iterative nature of modeling highlight an important and little-studied problem for machine learning tools: model management. Model management is the problem of tracking, storing, and indexing large numbers of machine learning models so that they may subsequently be reproduced, shared, queried and analyzed. I will present ongoing work on ModelDB, an open-source system to manage machine learning models. ModelDB automatically tracks models in their native environments (specifically spark.ml and scikit-learn), indexes them intelligently, and allows flexible exploration of models via SQL as well as a visual interface.
Huihui Fan, Finale Doshi-Velez and Luke Miratrix. Promoting Domain-Specific Terms in Topic Models with Informative Priors
Abstract: Latent Dirichlet Allocation (LDA) models trained without stopword
removal often produce topics with high posterior probabilities on
uninformative words, obscuring the underlying corpus content. Even
when canonical stopwords are manually removed, uninformative words
common in that corpus will still dominate the most probable words in a
topic. We propose a simple strategy for automatically promoting terms
with domain relevance and demoting these domain-specific stop
words that combines two ideas: (1) giving different priors to different topics to encourage separation between stopwords and domain content, and (2) using asymmetric priors to penalize likely stopwords in domain-targeting topics. Our approach is easily applied within any existing LDA
framework and increases the amount of domain-relevant content and
reduces the appearance of canonical and human-evaluated stopwords in
three very different domains: Department of Labor accident reports,
online health forum posts, and NIPS abstracts. Our methods at least tripled the domain-relevant words captured and reduced the appearance of canonical stopwords by over 80%. Along the way, we show that standard topic quality measures such as coherence and pointwise
mutual information act counter-intuitively in presence of common but
irrelevant words. We also explain why these standard metrics fall
short, propose an additional topic quality metric that targets the
stopword problem, and show that it correlates with our human subject
Jessica Verena Schulze. Functionally Informed Priors in a Bayesian Machine Learning Approach to Neuronal Connectivity Inference
Abstract: The recent advancement of two-photon microscopy technology enables the recording of several hundreds up to several thousands of neurons simultaneously. This presents us with a new found wealth of data, for which manual methods of analysis are becoming increasingly inefficient. In this study we aim to contribute to the development of computational methods for the analysis of neural circuits and information processing from large-scale neural recording data, specifically the problem of neural connectivity inference. Neural connectivity inference is a major challenge in the area of neuroscience and advances in this area will facilitate the study of information processing the the brain. We propose to incorporate functionally information priors, that reflects the function the neural network implements, into a Bayesian approach to this task.

Previous work by Mishchenko in 2011, suggest that the use of functionally informed priors has significant benefit. Such previous studies used rather simple priors, like spareness of the neural connections. While we are basing our method on such previous work and are following their example of using a Bayesian approach in which we encode the neural networks dynamics in a Hidden Markov Model for which the unknown parameters are approximated using an Expectation Maximisation algorithm with Sequential Monte Carlo as a sampling method, our approaches novelty lies in the investigation of more specific biologically informed priors. One specific example is the use of distance measures across neurons.

To develop a our method for inference of neural connectivity we are working with two-photon recordings of layer 2, 3 and 5 of the parietal and secondary visual cortex of ten mice undergoing a virtual navigation task. This area of the brain is known to be involved in the integration of sensory and motor input and represent ego-centric information about the position on a map and thus gives us significant information about its biological function.

The novelty, of this study lies in its rigorous incorporation and analysis of more specific biologically informed priors, as well as the application to an extensive in-vivo data set on a bigger scale than has previously been achieved with such methods.

**This is an ongoing study and final result have not been obtained yet.**
Zita Marinho, Shay B. Cohen, Andre F. T. Martins and Noah A. Smith. Semi-Supervised Learning of Sequence Models with the Method of Moments
Abstract: We propose a moment-matching method for semi-supervised learning of sequence models. Spectral learning and moment-matching approaches have recently proved a viable alternative to expectation-maximization (EM) methods for unsupervised learning, supervised learning with latent variables and topic modeling. These methods have learnability guarantees, they do not suffer from local optima, and are computationally less demanding [1].
Unlike spectral methods, our approach does not require an orthogonal decomposition of any matrix or tensor. Instead, we consider a more restricted form of supervision: words that have unambiguous annotations, denoted as anchor words [2]. We extract them from a small labeled dataset or a dictionary. Given the anchor words, the estimation of the model parameters can be made efficiently by collecting moment statistics from the unlabeled data, and solving a small quadratic program for each word.
Our contributions are as follows are threefold: (1) we adapt anchor methods to semi-supervised learning of generative sequence models; (2) we show how our method can also handle log-linear feature-based emissions; (3) we apply this model to POS tagging. We experiment on a Twitter dataset introduced and on Malagasy dataset, a low-resource language, and we show that our method does particularly well with very few labeled data, outperforming other semi-supervised methods, such as EM and self-training.

We propose a fast and scalable method for semi-supervised learning of sequence models, based on anchor words and moment matching. Our method can handle hidden Markov models with feature-based log-linear emissions. Unlike other semi-supervised methods, no decoding passes are necessary on the unlabeled data and no graph needs to be constructed— only one pass is necessary to collect moment statistics. The model parameters are estimated by solving a small quadratic program for each feature. Experiments on part-of-speech (POS) tagging for Twitter and for a low resource language (Malagasy) show that our method can learn from very few annotated sentences.
Note: This work will be presented at EMNLP 2015 as a full research paper.

[1] Daniel Hsu, Sham M. Kakade, and Tong Zhang. 2012. A spectral algorithm for learning hidden markov mod- els. Journal of Computer and System Sciences, 78(5):1460–1480.
[2] Sanjeev Arora, Rong Ge, Yoni Halpern, David Mimno, David Sontag Ankur Moitra, Yichen Wu, and Michael Zhu. 2013. A practical algorithm for topic model- ing with provable guarantees. In JMLR Proceedings, ICML.
Diane Bouchacourt, M. Pawan Kumar and Sebastian Nowozin. DISCO Nets : DISsimilarity COefficient Networks
Abstract: We are interested in the class of problems that require the prediction of a structured output y given an input x. Complex applications often have large uncertainty on the correct value of y. Consider for example the task of hand pose estimation from depth images, where one wants to accurately estimate the pose y of a hand given a depth image x. The depth image often present some occlusions and missing depth values. This results in some uncertainty on the pose of the hand and multiple poses can be relevant for a given image. It is, therefore, natural to use probabilistic models that are capable of representing the distribution of the output for a given input.

Furthermore, in most cases the capacity of the model is restricted and cannot capture the true distribution perfectly. In this case, the choice of the learning objective influences the final performance. Similar to [3], we argue that the learning objective should be tailored to the loss employed at test time to obtain the best performance with respect to this loss.

Our project presents a new type of probabilistic model which we call DISsimilarity COefficient Networks (DISCO Nets). DISCO Nets allow us to efficiently sample from the posterior distribution of the output, parametrised by a neural network. DISCO Nets employ a prediction and training procedure based on the minimisation of a dissimilarity coefficient. This coefficient is based on the expected value of the loss between samples drawn from the estimated distribution and the true distribution. Theoretically, this ensures that DISCO Nets accurately capture uncertainty on the correct output to predict given an input. Furthermore, this coefficient is defined for any non-negative symmetric loss function, allowing us to easily tailor our model to the task at hand.

Experimental results on the task of hand pose estimation consistently support our theoretical hypothesis as DISCO Nets outperform non-probabilistic equivalent models, and existing probabilistic models. Our work [1] will be presented as a poster during the NIPS 2016 conference.

We have only considered continuous output space so far and employ gradient descent during training. However, when the output space is discrete the gradient of the expected value of the loss can no longer be computed as the expected value of the gradient. A common technique in this case is to employ a surrogate loss function. However, we expect to achieve better performance by directly minimising the expected value of the loss, as in the continuous case. Therefore as future work we will extend DISCO Nets to the case of a discrete output space using a direct loss minimisation method similar to [2].

[1] D. Bouchacourt, M. P. Kumar and S. Nowozin, “DISCO Nets : DISsimilarity COefficients Networks”, In NIPS 2016.
[2] T. Hazan, J. Keshet and D. A. McAllester, “Direct Loss Minimization for Structured Predictions”, In NIPS 2010.

[3] S. Lacoste-Julien, F. Huszar, and Z. Ghahramani. “Approximate Inference for the Loss-calibrated Bayesian”. In AISTATS, 2011.
Alejandrina Cristia and Emmanuel Dupoux. Learnability Differences in Child- versus Adult-Directed Speech: The Case of Unsupervised Pattern Discovery
Abstract: Human cognition has long provided challenges for machine learning, but developments in recent years have sometimes led to systems that can achieve super-human performance (e.g., face recognition, the game of go). In one domain of cognition, nonetheless, humans remain simply unsurpassable, that of native language acquisition. This is a task that is trivially easy for children, who, without apparent external supervision, routinely acquire one or more languages by the age of 4 years, after exposure to maximally 8k hours of speech. In stark contrast, the best speech recognition and natural language processing systems on the market today require up to 100 times those quantities of input to achieve a level of performance that is substantially lower than that of humans.
Our research program seeks to use methods from applied engineering to investigate both the learning mechanisms and the linguistic experiences that may be required to solve this task. A first step in this quest has been to investigate whether qualitative differences in the kind of input that humans and machines are receiving today could in part account for differences in asymptotic learning. Specifically, machines today are mostly trained with spoken and written corpora that are very much unlike the child-directed speech that infants are thought to be typically exposed to, and it is conceivable that child-directed language has properties that make extraction of linguistic structure easier. In two studies, we investigated one important subtask, the unsupervised discovery of recurring patterns (words), assessing whether systems trained with child-directed speech outperform those trained with adult-directed speech.
In a first study, we applied an acoustic motif discovery algorithm onto acoustic speech where parents described objects to their young infants, or to a fellow adult [1]. By and large, performance was comparable across child- and adult-directed speech, although there were small but significant differences depending on the way one parameter was set. In line with the expectation of better performance for child- than adult-directed speech, cluster purity was higher in the former than the latter; on the other hand, both recall and cluster collocation were lower. These results may be due to the fact that parents produce more acoustically variable speech when addressing their children, a tendency which will negatively impact pattern detection that uses raw acoustics as input.
Therefore, a second study investigated two child- and adult-directed language databases that have been orthographically transcribed by humans, and where speech could thus be represented in a format that is more stable to such pronunciation differences [2]. Six unsupervised pattern detection algorithms, including both state-of-the-art systems and others that are common in cognitive literature, were applied to a pronunciation-like format derived using text-to-speech technology. For nearly all combinations of database and algorithm, performance was higher for child- than adult-directed speech. We discuss the implication of these results for the quest of computational systems' achieving human-like performance.

[1] Ludusan, B., et al. (2015). Motif discovery in infant- and adult-directed speech. EMNLP, 93-102.
[2] Cristia, A. et al. (in preparation). Word segmentation from children's input
Agnieszka Słowik. Analysis of projections defining probability distributions in random neural networks
Abstract: As a result of constant improvements in data retrieval and processing finding a large dataset is no longer the main problem. However, reducing computational complexity of the learning phase remains the challenge which is not solved by the recent developments in Deep Learning. One possible approach is to reduce the number of trained parameters in neural networks by introducing random projections.

My thesis analyses impact of using different probability distributions on the effectiveness of randomized neural networks. In my work I’m showing that the choice of distribution matters and in the worst case it might even make a model impossible to train. The thesis compares three types of random projections (data-independent, semi-supervised and supervised) and examines its effect on Extreme Learning Machines, RNN model introduced by Guang-Bin Huang. I’m using 10 opensource datasets (both for regression and classification problem) from libsvm and two common activation functions for ELM (sigmoid and Gaussian). My next step will be to compare the different models’ performance on a practical problem with real data such as compounds activity prediction in drug design (where ELM were proven to give better performance than SVM) or in robotics and control.
Christina Heinze-Deml, Brian McWilliams and Nicolai Meinshausen. Preserving Differential Privacy Between Features in Distributed Estimation
Abstract: Data driven personalization -- from user experience on the web to medicine and healthcare -- relies on aggregating a large amount of potentially sensitive data relating to individuals from disparate sources in order to answer statistical queries. Understandably, from a privacy perspective it may be undesirable -- or even impossible -- for such data to be shared in an undisguised form, making it difficult to learn from datasets that are partitioned between multiple data owners. Such distributed settings arise when different parties contribute different records to a dataset (the design matrix is “horizontally-partitioned”), or information about each record in the dataset is held by different data owners (the design matrix is “vertically-partitioned”).

The differential privacy (DP) framework guarantees that anonymity is preserved in a large dataset, providing a theoretically motivated alternative to current best practices and legal guidelines. However, in the distributed setting very few approaches exist for private data sharing for the purposes of statistical estimation and the classical setup of differential privacy with a 'trusted curator' preparing the data does not apply.

To this end, we introduce (epsilon, delta, S)-differential privacy which extends single-party differential privacy to the distributed, vertically-partitioned case. Furthermore, we propose PriDE, a scalable framework for distributed estimation where each party communicates perturbed sketches of their locally held features ensuring (epsilon, delta, S)-differentially private data sharing. For L2-penalized supervised learning problems PriDE has bounded estimation error compared with the optimal estimates obtained without privacy constraints in the non-distributed setting.

Thus PriDE allows to pose statistical queries which depend on all of the features relating to the subjects in a vertically-partitioned dataset. Each party can then estimate the portion of the true underlying parameter vector corresponding to the features it holds, while accounting for the contribution of the features held by the remaining parties. This is particularly relevant whenever there are confounding variables that are non-local to party k but that influence both the features held by party k as well as the response Y. In such a setting, the estimates would be biased if the model was learnt without adjusting for the confounders.

We confirm the performance of PriDE empirically on real world and synthetic datasets. In our experiments, PriDE improves upon a fully-private baseline which avoids communicating features between parties and quickly approaches the performance of the optimal solution estimated without respecting privacy in the non-distributed setting.
Anastasia Ushakova and Slava Mikhaylov. Predicting energy customer vulnerability using smart meters data: a survey of methods and results.
Abstract: In this research, we assessed various machine learning techniques for the classification of smart meter energy consumption data recorded at half-hourly intervals for households in the UK. The main motivation for our research was to provide evidence of how big data may be used in the energy sector to inform the decision-making process in the evaluation and development of public policies.

Our data had two classes based on the financial vulnerability of the customer. Non-vulnerable customers were denoted with a 0, while vulnerable customers were denoted with a 1. However, it should be noted that the data were biased towards the non-vulnerable class with a ratio of 75:25. As an initial step, we clustered the temporal profiles of daily consumption to identify homogeneous groups associated with each class. Subsequently, we analysed how customer consumption behaviour can be used for the prediction of household vulnerability.

Neural networks, random forests, naive Bayes and support vector machines were evaluated for the classification of consumer vulnerability based on half-hourly energy consumption records. Having previously identified class imbalance within the data, various techniques were employed to address the issue. A positive effect was observed in the case of the random forest model with accuracy increasing from 81% to 95% due to the rebalancing.

Random forests were found to outperform all other models significantly with an observed accuracy of 95% versus the 60% average for the other models. This improvement may be attributed to the high variation in half-hourly energy loads. In addition, the heterogeneous levels of interdependencies associated with half-hourly energy consumption may also arise from the idiosyncratic usage of natural gas at the household level. In conclusion, we provide suggestions for further research that may improve other models' performance to predict customer flags based on their behaviour. We also encourage other researchers to exploit the data provided by smart meters due to its significant potential in the development and modification of the current classification techniques.
Imaculate Mosha. Dynamic Convex Hull Self Organising Map for the Travelling Salesman Problem.
Abstract: Travelling Salesman Problem(TSP) is a combinatorial optimization problem. The challenge is , given a list of points, to find the optimal tour which is the shortest closed tour passing through each point only once. It is a NP-complete problem[3] which motivates for heuristic solutions.

Several Machine Learning techniques that have applied to solve it include the Hopfield-type neural networks, Elastic Net , Genetic Algorithms and the Self Organising Map[6]. The Self Organising Map(SOM) is an unsupervised learning algorithm, which establishes a topological relationship among input data.This is achieved by inspecting data values and adjusting its’ neurons synaptic weights to fit the data values through competitive selection of a winner neuron for each input signal and cooperative adaptation of the winner and its neighbours. Convex Hull SOM[1] (CHSOM) enforces the Convex Hull Property of the optimal tour by initialising the neurons with convex hull vertices of input cities. It has produced good results on various graphs of TSPLIB[4]. Its drawbacks are that many parameters have to be predetermined and the length of the resulting tour is sub optimal
Drawing upon techniques applied in [2], I propose the Dynamic Convex Hull SOM (DCHSOM), an algorithm that extends it by the following ways:
Adaptations to all winner neurons irrespective of whether they are initliazing neurons.
Some parameters such as the number of iterations and the width of neighbourhood are dynamically determined.
When neurons are created, their weights are initialised by the average of both neighbors.
Upon completion, mapping of neurons to points is made by a novel matching technique between neurons and points.

The result is an algorithm that is fast and dynamic , suitable for large inputs and produces tours of length within 10% of the optimal tour. Experimental results on TSPLIB[4] data show that this DCHSOM achieves considerable improvements over other neural network algorithms presented in [5].

[1] Xu X, Jia Z, Ma J, Wang J ,A Self-organizing Map Algorithm for the Traveling Salesman Problem ,2008, Fourth International Conference on Natural Computation

[2] Bai Y, Zhang W, 2006, An Efficient Growing Ring SOM and Its Application to TSP, Proceedings of the 9th WSEAS International Conference on Applied Mathematics, Istanbul, Turkey, May 27-29, 2006

[3] Papadimitriou, C. H., The Euclidean traveling
salesman problem is NP-complete. Theoretical Computer Science , Vol 4, 1978, pp. 237-244.

[4] G. Reinelt, TSPLIB - A Traveling Salesman Problem
Library, ORSA Journal on Computing, 3(4), 1991, pp. 376-384.

[5] Maire B, Mladenov V, Comparison of Neural Networks for solving the Travelling Salesman Problem, 11th Symposium on Neural Network Applications in Electrical Engineering, 2012

[6] Kohonen, T., Self-organized formation of topologically correct feature maps, Biol. Cybern, Vol. 43, No. 2, 1982, pp. 59-69.
Baharan Mirzasoleiman, Morteza Zadimoghaddam and Amin Karbasi. Fast Distributed Submodular Cover: Public-Private Data Summarization
Abstract: In this paper, we introduce the public-private framework of data summarization motivated by privacy concerns in personalized recommender systems and online social services. Such systems have usually access to massive data generated by a large pool of users. A major fraction of the data is public and is visible to (and can be used for) all users. However, each user can also contribute some private data that should not be shared with other users to ensure her privacy. The goal is to provide a succinct summary of massive dataset, ideally as small as possible, that is customized to each user: a summary can contain elements from the public data (for diversity) and user’s private data (for personalization). To formalize the above challenge, we assume that the scoring function according to which a user evaluates the utility of her summary satisfies submodularity, a widely used notion in data summarization applications. This property (informally) states that the added value of a new element to a smaller summary is more than its added value to the larger one. Thus, we model the data summarization targeted to each user as an instance of a submodular cover problem. However, when the data is massive it is infeasible to use the centralized (and sequential in nature) greedy algorithm to find a customized summary even for a single user. Moreover, for a large pool of users, it is too time consuming to find such summaries separately. Instead, we develop a fast distributed algorithm for submodular cover, FASTCOVER, that provides a succinct summary in one shot and for all users. We provide strong theoretical guarantees on the performance of FASTCOVER in terms of the solution it returns and the number of rounds it takes. More specifically, we show that the solution provided by FASTCOVER is competitive with that of the centralized algorithm with the number of rounds that is exponentially smaller than state of the art results. Moreover, we have implemented FASTCOVER on Spark to demonstrate its practical performance on a number of concrete applications, including personalized location recommendation, personalized movie recommendation, and vertex cover on tens of millions of data points and varying number of users. On problem instances where we can run the centralized greedy algorithm, we usually observe that FASTCOVER provides smaller solutions in much less time.
Eleni Triantafillou, Jamie Ryan Kiros, Raquel Urtasun and Richard Zemel. Towards Generalizable Sentence Embeddings
Abstract: Natural language is an integral part of numerous applications, such as web search and information retrieval. Therefore, constructing meaningful language representations that can inform these tasks is an important problem. In addition, despite having well-established methods to construct word representations, it remains an open problem to capture the semantics of larger pieces of text in a vector that is useful for different tasks with minimal adaptation.

This process can be thought of as ‘embedding’ sentences into a high-dimensional space, so that ‘related’ sentences are mapped to neighbouring points. There are, however, many notions of closeness that may be desirably reflected in the embeddings: sentences could be considered similar if they are likely to be found in the same context, if the second is entailed from the first, if they are paraphrases of each other, etc. We hypothesize that an embedding space that adheres to multiple of these notions can host more generalizable vectors: informative for different tasks, requiring less finetuning.

To test this hypothesis we introduce a sentence encoder, SNLI-Skipthoughts, that is learned by injecting supervised information from the Stanford Natural Language Inference (SNLI) dataset in the commonly used Skipthought embeddings [1]. SNLI labels each of 570K pairs of sentences as ‘entailment’, ‘contradiction’, or ‘neutral’. The aim is to enhance Skipthoughts with an additional notion of similarity through this supervision.

We compare SNLI-Skipthoughts with Skipthoughts and with different sentence encoders both in a supervised and an unsupervised fashion. In the former, a classifier is trained on the embeddings for each evaluation task: paraphrase detection, semantic relatedness, natural language inference and various classification benchmarks. However, we focus more on unsupervised evaluation since we believe it better reveals the properties of the embedding spaces. By using cosine similarity directly as a proxy for relatedness, we evaluate the embeddings through semantic relatedness and paraphrase ranking, correlation of the learned similarity notion with human judgements, paraphrase detection with little or no training, and examination of embedding statistics.

Our results support the hypothesis that incorporating SNLI information into Skipthoughts induces a more ‘generalizable’ space. Skipthoughts are known to perform remarkably well in the supervised evaluation setting. We show that adding the SNLI information does not harm their performance, while in some cases slightly improves it. In the unsupervised setting, we show that SNLI-Skipthoughts outperform Skipthoughts on all tasks considered. We believe that this is an indication of their higher-quality embedding space. Further, the inspection of the distribution of pairwise cosine similarities of sentences from a paraphrase detection dataset is interesting. In Skipthought space, there is a higher mean similarity and a lower variance than in SNLI-Skipthought space. We conjecture this can be attributed to the additional notion of relatedness captured by SNLI-Skipthoughts which pushes contradictory sentences further away from each other, in order to keep sentences which entail each other close.

Full text is available [2].

[1] Kiros, R. et al. Skip-thought vectors. In Advances in neural information processing systems 2015.
[2] Triantafillou, E. et al. Towards Generalizable Sentence Embeddings. RepL4NLP in ACL 2016.
Barbara Plank. The side benefit of behavior: using keystroke dynamics to inform Natural Language Processing
Abstract: When people produce or read texts, they produce loads of by-product in
form of behavioral data. Examples include click-through data, but also
more distant sources such as cognitive processing data like eye
tracking or keystroke dynamics. Such fortuitous data [5] represents a
potentially immense resource of side benefit in the form of noisy
auxiliary data. However, can we use such auxiliary data to improve
natural language processing? Only very little work exists, some first
promising attempts mainly focused on gaze pattern data e.g.,
[1,2]. There is no prior work yet that explores keystroke dynamics.

Keystroke dynamics concerns a user's typing pattern. When a person
types, the latencies between successive keystrokes and their duration
reflect the unique typing behavior of a person. Keystroke dynamics
have been extensively used in psycholinguistic and writing research to
gain insights into cognitive processing. Keystroke logs have the
distinct advantage over other cognitive modalities like eye tracking
or brain scanning, that they are readily available and can be
harvested easily, because they do not rely on any special equipment
beyond a keyboard. Moreover, they are non-intrusive, inexpensive, and
have the potential to offer continuous adaptation to specific
users. Imagine integrating keystroke logging into (online) text
processing tools. But do keystroke logs contain actual signal that
informs natural language processing (NLP) models?

We postulate that keystroke dynamics contain information about
syntactic structure that can inform shallow syntactic parsing. To test
this hypothesis, we perform first experiments in which we use
keystroke dynamics as auxiliary data in a multi-task learning setup
[3,4]. In particular, we first need to refine the raw keystroke data,
device a simple approach to derive automatically-labeled data from raw
keystroke logs (in particular, pre-word pauses), and integrate them as
auxiliary task in a multi-task bidirectional LSTM model. We show the
effectiveness of using auxiliary keystroke data on two shallow
syntactic parsing tasks, chunking and CCG supertagging. Our model is
simple, has the advantage that data can come from distinct sources,
and produces models that are significantly better than models trained
on the text annotations alone.

Note: this work will be presented at COLING 2016, and the full text of
this submission is available at [4].


[1] Barrett, Maria; Søgaard, Anders. 2015. Using reading behavior to
predict grammatical functions. EMNLP Workshop on Cognitive Aspects of
Computational Language Learning. Lisbon, Portugal.

[2] Klerke, Sigrid; Goldberg, Yoav; Søgaard, Anders. 2016. Improving
sentence compression by learning to predict gaze. North American
Chapter of the Association for Computational Linguistics (NAACL). San
Diego, CA.

[3] Barbara Plank, Anders Søgaard and Yoav Goldberg. Multilingual
Part-of-Speech Tagging with Bidirectional Long Short-Term Memory
Models and Auxiliary Loss. In ACL, 2016. Berlin, Germany.

[4] Barbara Plank. Keystroke dynamics as signal for shallow syntactic
parsing. The 26th International Conference on Computational
Linguistics (COLING). Osaka, Japan.

[5] Barbara Plank. What to do about non-standard (or non-canonical)
language in NLP. In KONVENS 2016. Bochum, Germany.
Setareh Ariafar, Alican Bozkurt, Kivanc Kose, Junxiang Chen, Dana Brooks, Milind Rajadhyaksha and Jennifer Dy. Network-SVM: Support Vector Machine for Network Data
Abstract: Many real-world applications produce data with relational information represented as networks, for example hyper-linked web pages, and social networks. Traditional classification algorithms assume that samples to be classified are independent and thus do not take into account any relational/network structure that might be present. In this paper, we propose a novel large-margin classifier designed for classifying network data, which we call Network-SVM (Network Support Vector Machine). Our method takes into account both the attributes describing each sample individually and the full network dependency structure that connects samples. Incorporating network information based on training data can lead to bias in learning and inference, which means a misclassified sample may negatively affect its neighbors' prediction. Network-SVM avoids this bias by taking into account the uncertainty of the network neighbors' predictions. Finally, Network-SVM is formulated as a quadratic objective function optimized over a convex linear set, which can be solved efficiently after a linear programming relaxation, and it extends to generation of nonlinear decision boundaries using kernels.
Maria Herrero. Machine Learning in Pharmaceutical Science
Abstract: Last advances on information technologies have led to a rise in the amount of accessible pharmaceutical and pharmacological information. The combination of these huge amounts of data and machine learning techniques can bring new methods in pharmaceutical research. Here, I describe my current work at the Institute of Pharmaceutical Sciences exploring the application of machine learning to various research problems including the areas of pharmaceutics, pharmacovigilance, and drug repurposing.
The first one refers to the prediction of phase behavior in microemulsion systems. Microemulsions offer considerable potential for use as drug delivery systems but the identification of suitable cosurfactants for pharmaceutical formulation requires extensive experimental work. However, Artificial Neural Networks (ANN) have proven to successfully predict the phase behaviour of microemulsion-forming systems, although work to increase these predictions at the interface level is still required.
Pharmacovigilance is the science and activities relating to the detection, assessment, understanding, and prevention of adverse effects or any other drug-related problem. Drug-drug interactions (DDIs) is a type of adverse effects that can affect to the patients treated with more than one drug. Although some researchers have applied machine learning to the identification of unknown DDIs, the prediction of their clinical relevance had been not tackled until now. We have observed that using drug pharmacokinetic and pharmacodynamic properties and drug-protein relationships as input data to create ANN models is a promising approach for the identification of potentially serious DDIs.
The last application described here is drug repurposing or the identification of new indications for licensed drugs. Specifically, we have focused on the discovery of new treatments for psychiatric conditions through the application of machine learning methods. Support vector machines (SVM) and ANN have outperformed classification trees (CT) and random forest (RF) for the prediction of potential antipsychotics based on chemical structure properties and drug-protein relationships. However, a better description of the pathophysiological basis of these diseases is required. Thus, we have represented psychosis-related proteins and drugs in a biological network and used its topological properties as input data to create ANN models. In our future work, we plan to investigate if the representation of drug-protein relationships as network topological properties influences the performance of the models compared to other representations.
Although still in progress, I believe that these projects provide a great example of the interesting, diverse, and useful applications for machine learning in the pharmaceutical domain.
Saerom Park, Jaewook Lee, Kyoungok Kim and Huisu Jang. Semi-supervised document embedding adjusting local structure for sentiment analysis
Abstract: Sentiment analysis has been one of the most popular business applications of text analytics because sentiments reveal opinions of people and extract some meaningful or useful information from textual sources. In order to analyze text documents in terms of the sentiment, data scientists used to classify or visualize them after getting numerical representations. However, how to represent the documents as numerical vectors is the problem. The texts are usually taken as input in unsupervised framework, so the embedding tends to lose plenty of meaning in texts. Therefore, we need to complement this deficiency by using other information concerning the text and in the result can get better document embedding learned with semi-supervised framework than before.
There are two approaches widely used for document representations; Bag of words and distributed representations. The first approach which is conventional is very simple and intuitive, but it suffers from the curse of dimensionality and fails to preserve context or semantic information. On the other hand, the second approach tries to reflect the context information of documents, and it can avoid the dimension problem. In this research, we used the distributed approach (also called doc2vec). Although the document vectors are trained to maximize the conditional probability considering contexts, it loses the direct relationship with words consisting of the text. Therefore, the document embedding from original doc2vec algorithm has difficult in reflecting the sentiment information of words, and has inferior performance in sentiment analysis to bag-of-words model. To overcome this issues, we built the semi-supervised document embedding model, preserving the overall structure of the embedding and adjusting the local structure with partial sentiment information.
The objective is to maximize not only the conditional probability of context words but also the conditional probabilities of neighbor documents. In other words, we manipulates the document embedding to elevate the probability of the document conditioned on the near documents with the same sentiment by gradient descent updates. In case of unlabeled documents, after their labels were given probabilistically according to the label distribution of their neighbors, the conditional probabilities of neighbor documents were added to the objectives and trained in the same way. We developed the model considering the label information of only the neighbors since the similarity between distant documents can be inaccurate. In the results, our semi-supervised learning algorithm learned the numerical representations unifying the contextual information, document similarity and partial sentiment information.
In the real-world application, our proposed method showed that the document representations revealed the sentiment distribution effectively. It outperformed the traditional distributed representation approaches in both classification and visualization aspects. As our intention, the similar documents were located nearby not only in syntactic, contextual and structural aspects but in sentimental aspects. Our primary contribution is to propose new semi-supervised learning framework in learning the embedding. This approach also has extensibility and possibility of the development or improvement in considering local information and additional new information. In the result, we can get the document representation with more abundant expressive power.
Delia Fernandez-Canellas, Victor Campos, Xavier Giró Nieto, Brendan Jou and Shih-Fu Chang. Is a “happy dog” image more “happy” than “dog”? - Analyzing Adjective and Noun Visual Contributions
Abstract: Computers are acquiring increasing ability to detect high level visual content such as objects in images, but often lack an affective comprehension of this content. Affective computing is useful for behavioral sciences, with applications in brand monitoring or advertisement effect. The main problem of the visual task of mapping affect or emotions to images is overcoming the affective gap between low-level features and the image emotional content.

One rising method to capture visual affections is through the use of Adjective-Noun Pair (ANP). ANPs were introduced as a mid-level affect representation to overcome the affective gap by combining nouns, which define the object content, and adjectives, which add a strong emotional bias, yielding concepts such as “happy dog” or “misty morning”.

Current state of the art methods approach ANP prediction by training visual classifiers on these pairs. In this work, we hypothesize that the visual contribution between nouns and adjectives differ between ANPs. We propose a feature-based intermediate representation for ANP prediction using specialized convolutional networks for adjectives and nouns separately. By fusing a representation from nouns and adjectives, the network learns how much the nouns and adjectives contribute to each ANP, which a single tower network does not allow.

The specialized noun and adjective networks follow an AlexNet-styled architecture. These networks are fused into an intermediate feature representation, and ANPs are then learned from it using a fully-connected network. We investigate noun and adjective contributions with two kinds of fusions. First fusion uses the output of the softmax layer: these are class-probability features, so all dimensions have class-correspondence to adjectives and nouns. Second fusion uses the fc7 layer output: these features contain visual information, allowing interpretation of adjective and noun visual relevance. For the feature contributions of each ANP, we compute a deep Taylor decomposition [1].

For experiments, we use a subset of 1,200 ANPs from the tag-based English-MVSO [2] dataset. ANPs are composed by the combination of 350 adjective and 617 nouns. With identical settings to the adjective and noun networks, an ANP classification network is trained end-to-end as the baseline. Using the fc7 features, we improve over the baseline in both top-1 and top-5 accuracy. Also, we observe adjective and nouns contribute differently between ANPs; e.g. for the ANP “pregnant woman”, the adjective contributes the most, while for “cute cat” the predominant contribution is in the noun. Using the probability features we find other insights, as nouns or adjectives co-occurring together, e.g. for “happy halloween” the contribution is also high of the nouns “blood” and “cat”, and of the adjectives “haunted” and “dark”.

Based on experiment results, we confirm our hypothesis of adjective and nouns contributing differently to ANP concepts. Furthermore, our architecture proves to outperform traditional methods by giving insights on the role of adjectives and nouns on the prediction.

[1] Montavon, Grégoire, et al. "Deep Taylor Decomposition of Neural Networks." ICML Workshop on Visualization for Deep Learning, 2016.

[2] Jou, Brendan, et al. "Visual affect around the world: A large-scale multilingual visual sentiment ontology." ACMM, 2015.
Maria Carmela Padula, Elisa Scariati, Marie Schaer, Maude Schneider, Dimitri Van De Ville and Stephan Eliez. Predicting psychotic symptoms from DTI-based structural connectivity in patients with 22q11DS
Abstract: 22q11.2 deletion syndrome (22q11DS) is a neurogenetic condition caused by a de novo mutation on chromosome 22q11. Patients with 22q11DS are characterised by an ultra-high risk of developing schizophrenia (Murphy et al., 1999). Thus, this syndrome represents a homogeneous model for the study of neural biomarkers related to psychosis (Bassett and Chow, 1999). Impairments in structural connectivity have widely been reported in 22q11DS, and have been related to the presence of psychotic symptoms (Scariati et al., 2016). Furthermore, a resting-state fMRI study showed that functional dysconnectivity predict with an accuracy of 88% the psychotic diagnosis in these patients (Scariati et al., 2014). No studies to date have described patterns of brain dysconnectivity predicting psychotic symptoms in this group of patients.

In this study, we used a multivariate pattern analysis to discriminate between 49 patients with 22q11DS with (N=25) and without (N=24) psychotic symptoms based on structural connectivity patterns. Structural T1-weighted images were processed using the software freesurfer (http://freesurfer.net) and the cortical surface was subdivided into 83 regions of interest (Desikan 2006). Diffusion Tensor Imaging (DTI) acquisitions were used to construct structural connectivity matrices using the software MRtrix (http://www.mrtrix.org). The orientation distribution function was computed using constrained spherical deconvolution and deterministic tractography was used for the reconstruction of the white matter streamlines. Classification was conducted using a linear Support Vector Machine (SVM) classifier starting from the connectivity matrices. After selection of the most salient features, using a point biserial correlation criterion, the classifier was trained using a Leave-One-Out cross validation (LOOCV). The global accuracy and the Wilson confidence interval were finally computed.

The maximum accuracy (73%) was obtained using the 550 highest correlated features, corresponding to a sensitivity of 71% and a specificity of 76%. The most discriminant features included connections of the frontal lobe as well as fronto-subcortical connections, including the thalamus and the brainstem. Dysconnectivity of the parietal lobe also contributed to the classification. 43% of the connections were reduced in psychotic patients compared to non-psychotics, while 57% were increased.

Our results suggest that patterns of white matter dysconnectivity allow to discriminate between patient with 22q11DS that manifest or not psychotic symptoms. Thus, structural connectivity alterations may serve as potential biomarkers for the development of psychosis in this patients.
Natasha Jaques, Shane Gu, Richard Turner and Douglas Eck. Generating Music by Fine Tuning Recurrent Neural Networks with Reinforcement Learning
Abstract: Generative modeling of music has been explored in the context of training Recurrent Neural Networks (RNNs) to predict the next note in a musical sequence. Similar to a character-RNN, these ‘Note-RNNs’ can successfully produce musical compositions, but tend to suffer from common failure modes, such as excessively repeating notes, or producing melodies that lack a consistent global structure. In this work we propose a novel method for using Reinforcement Learning (RL) to impose musical structure onto an RNN trained on data. The reward used in the RL framework is a combination of constraints based on music theory, as well as the output of a trained Note-RNN. We show that the objective function optimized by this model can be related to the variational objective used in Stochastic Optimal Control, and effectively combines both an RL and Maximum Likelihood (ML) cost. Using this model, we are able to significantly reduce unwanted behaviors of the RNN, and produce more pleasing melodies.

We begin by training a Note-RNN to predict the next note in a monophonic melody. To apply RL, we frame the problem of choosing the next note in a composition as acting in an environment. Deep Q-learning is used to model the estimated future return of choosing each possible note given the state of the composition. The trained Note-RNN is used as a Reward Network, which remains fixed during training. This Reward Network supplies the initial reward value for each note action, based on the probability it places on choosing that note as the next in the sequence. The reward value is further modified based on rules of music theory, which encourage the model to stay in key, avoid excessively repeating notes, resolve melodic leaps, and play musical motifs, among others.

Results show that our method can dramatically improve performance with respect to the rules of the reward function. For example, before the application of RL, 63.3% of notes chosen by the Note-RNN were excessively repeated, compared to only 0.07% after training with RL. Similarly, the basic Note-RNN resolved only 77.2% of melodic leaps, and only 5.58% of notes belonged to a motif; after RL, 90.0% of leaps were resolved, and 73.8% of notes belonged to a motif. Therefore, the application of RL can successfully impose the desired musical structure onto the model. Further, the melodies produced by the model sound more pleasing, musical, and varied than either the monotonous and sometimes dissonant melodies of the Note-RNN, or the deterministic melody produced by training with only RL but no Reward Network.

Our primary contribution is demonstrating that by combining RL and ML, we can impose additional structural constraints on the behavior of a model trained on data. This not only allows us to produce pleasing and varied melodies that obey musical theory, but can correct for common failure modes of an RNN, such as excessively repeating one token. Therefore this technique can have broader future applications, e.g. in modeling sequences of text.
Jauwairia Nasir, Yong-Ho Yoo, Deok-Hwa Kim and Jong-Hwan Kim. User preference-based Integrated Multi-memory Neural Model for Improving the Cognitive Abilities of Autonomous Robots
Abstract: Automated task planning for robots faces great challenges in that the sequences of events needed for a particular task are mostly required to be hard-coded. Also, apart from learning distinct experiences correctly, significant or recurring experiences are expected to be learned better and be able to be retrieved easily relatively.

For empowering robots with higher levels of autonomy, various cognitive architectures ([1]-[9]), neural memory models ([10]-[15]) and adaptive resonance theory networks ([16]-[20]) have been proposed for sequential learning which differ in their mechanisms used for storing and retrieving. Many models with various architectures and designs to represent semantic knowledge have also been proposed to improve the reasoning abilities of robots ([21]-[26]). However, to the best of our knowledge, we feel that the above mentioned spatio-temporal memory models do not cater for repetition and user preference while encoding events such that an alternate route for certain memories is emerged over time. Similarly, in our memory model, the integration of semantic knowledge with episodic memory is not only used for intra-category planning but also inter-category planning which has not yet been widely explored. Refer to ([27]-[29]).

We propose a hierarchical Fusion ART [30] based approach of user preference based integrated multi-memory model (pMM-ART) extended from [27]. Episodic memory is responsible for storing spatio-temporal sequences of events as episodes/tasks. Tasks that are considered more important by the user and are retrieved often get consolidated at certain point in time if their memory strength increases above a pre-defined threshold. In other words, the episodes move to consolidated memory which consists of their more stable version that is easier to recall and these consolidated episodes are referred to as semantic concepts.

Object Fact Map (OFM) is a map of objects, their attributes, object categories and the semantic relationship among these categories. It stores a unique representation of each object in terms of various characteristics (color, eatable, drinkable, storage, fragile, tags) and location. Then it clusters objects, based on user’s interpretation of what clusters should look like, in an object category. A Semantic Relation Matrix (SRM) defines the strength of relationship between categories based on the concepts and attributes the categories share. This information is used for extending planning to semantic categories rather than individual objects and for making decisions based on a reasoning rule base in various scenarios by an inference module.

Simulations and experiments have been conducted using Mybot-KSR to validate the model. An example illustrating the usefulness of pMM-ART would be that if a user teaches a service robot, say Mybot, to put pepsi cans in cabinet A and also teaches Mybot that cider, cola and pepsi are similar in their attributes, then Mybot would be able to infer that cider and cola should also be stored in cabinet A without ever learning this explicitly. In another scenario, if the user gave higher preference to drinking cider rather than coffee and frequently asked Mybot to prepare cider, Mybot will remember the first task better and predict the sequence of events for it earlier.
Angie Shen, Benjamin Goldstein and Rebecca Steorts. Dynamic Risk Prediction Models Used for Patient Deterioration
Abstract: Electronic Health Records (EHR) are health data collected in digital format, characterized by dense, serial information on patients collected at multiple time points. In recent years EHR systems have been increasingly adopted by hospitals and have been used to develop risk prediction models. Clinically, we propose to use EHR data to develop a dynamic risk prediction model for patient deterioration. Analytically, we propose to incorporate the time varying features of EHR data into a risk score.  

One commonly used metric for using EHR data to detect patient deterioration is the National Early Warning Score (NEWS). The NEWS summarizes patients’ vitals data into a single risk prediction score using fixed cut points. While designed to be easily implemented, NEWS is not tailored to any specific patient population, nor does it leverage all the data available in an EHR system. We propose a re-estimation of the NEWS, tailoring the model fit to our patient population and incorporating a broader set of available predictors. To fit the model we use a Cox proportional hazard model. The Cox model is a semi-parametric model to estimate time to an event that allows one to specify a parametric form for the covariates along with a non-parametric baseline hazard function. By formulating the data as a counting process, we are able to incorporate each patient's vitals measurements at multiple time points as covariates into the Cox regression model. We further incorporate individual dynamics into the model by including the trajectory of vitals change as extra predictors. Moreover, we assess the impact of different choices in setting up the data such as handling of missing data, binning time, size of training data and incorporating changes in measurements over time. We test different ways of formulating the data on a validation set and use partial log likelihood as an evaluation metric to choose the best model. Finally, we use cumulative AUC curves and incidence AUC curves to evaluate the final models' impact on clinical decision making. Overall, we illustrate how refitting of commonly used risk scores can lead to improved model performance.
Ciara Pike-Burke. Optimistic Planning for the Stochastic Knapsack Problem
Abstract: The stochastic knapsack problem is a classical resource allocation problem. It consists of determining which of a subset of items to place into a knapsack of fixed capacity where item weights are stochastic and placing each item in the knapsack provides some stochastic reward to the wearer. It is assumed that no information is known about the item size and reward distribution, only access to a generative model. The stochastic knapsack problem has attracted a lot of interest over the years due to the wide range of applications in areas such as scheduling, investment, portfolio selection and technology. Consider for example, an education app that suggests a one hour homework task for students. Each exercise (item) will take a random amount of time (size) and improve the student’s knowledge in a stochastic manner (reward). To make optimal use of the available time the app needs to track the progress of the user and adjust accordingly. We are interested in finding the optimal policy, that is, the optimal sequence of items to put into the knapsack where the choice of an item may depend on the realised size of previous items.

Optimistic planning algorithms are a variety of tree search algorithm that harnesses ideas from multi-armed bandits in order to find the optimal policy in a discounted Markov decision process. Starting from the basic policies consisting of just playing one item, the policies with highest upper confidence bound are chosen to be expanded and their child policies explored.

We propose to apply an optimistic planning strategy to the stochastic knapsack problem. In doing so, we remove the need for an artificial discount factor. Furthermore, our algorithm is adaptive to the amount of resources available. This is achieved by combining Azuma-Hoeffding bounds to simultaneously build confidence bounds for the capacity and reward. Our approach also makes efficient use of all samples gathered from the generative model, pooling samples between policies and bounds. We demonstrate that, with high probability, our algorithm returns an $\epsilon$-optimal solution to the stochastic knapsack problem. We bound the number of calls to the generative model required by the algorithm in terms of the number of $\epsilon$-critical policies, or the number of policies that may potentially need to be evaluated by the algorithm in order to obtain an $\epsilon$ approximation. We also compare our algorithm to a basic adaptation of StOP from [1] and the current best performing algorithm in the operational research literature [2] in a variety of experimental scenarios.

[1] Szörényi, B., Kedenburg, G., & Munos, R. (2014). Optimistic planning in Markov decision processes using a generative model. In Advances in Neural Information Processing Systems (pp. 1035-1043).
[2] Dean, B. C., Goemans, M. X., & Vondrák, J. (2008). Approximating the stochastic knapsack problem: The benefit of adaptivity. Mathematics of Operations Research, 33(4), 945-964.
Andrea Barraza. Exploitation-Exploration Aware Diversification for Recommendation Systems
Abstract: Recommendation Systems (RS), are tools that help users discover new and interesting products by means of proactive personalized suggestions. Recent research has found that diversity for RS is a desirable but lacking feature, where the aim is to decrease the redundancy in results by offering users a range of options, not a homogeneous set of alternatives. With diversity, RS can: (a) encourage product discovery by incentivizing users to explore unknown sections of the catalogue, (b) cover a wider spectrum of user preferences, and (c) respond to ambiguous user preferences with a list of varied items, thus increasing the chance the user will like at least one item.

In related work, the diversification challenge is focused on the notion that there is a trade-off between the individual accuracy of an item and the overall diversity of the recommendation list. Thus, it has been defined as a bi-criterion optimization problem where the goal is to select k items from a broader set that maximizes both relevance and diversity. We introduce the belief that in RS, in addition to the diversity vs. relevance trade-off, there is also a trade-off between exploitation of the user profile and exploration of novel products. To solve this, we propose a diversification solution that can be tuned towards either more explorative or more exploitative recommendations.

In [1] we introduce XPLODIV, a post-filtering diversification approach that offers parameters to control the both the trade-offs between, diversity vs. relevance and exploitation vs. exploration. This technique is founded on four core dimensions: relevance, diversity, exploitation and exploration. The approach is formulated as an optimization problem, which aims to choose from a set of candidate items (generated by a conventional RS), a subset of diversified items that maximizes the XPLODIV optimization function. The approach has two control parameters used to tune both the mentioned trade-offs.

Through experimental validation, we show that XPLODIV: (a) can be tuned towards more exploitative diverse results or more explorative diverse results, with controlled sacrifice over relevance and (b) generates comparable results to baselines and state-of-the-art approaches. For experimental validation, both qualitative and quantitative tests were carried out over the MovieLens 100k dataset.

As a step towards facing the diversification problem in RS, we present XPLODIV. Our approach not only considers the trade-off between diversity vs. relevance, but also the trade-off between exploitation vs. exploration of products. As an advantage, our technique allows control over how much the RS could exploit the known information about the user by covering the preferences in the user profile, and how much the RS could explore other preferences the user might have by offering novel products. Our diversification technique presents an improvement over current diversification approaches, as it explicitly considers the aspect of exploration which, given its direct influence on factors related to novelty, serendipity and discovery; is essential to the success of any RS.

[1] Barraza-Urbina, A., Heitmann, B., Hayes, C., Carrillo, A. XPLODIV: An Exploitation-Exploration Aware Diversification Approach for Recommender Systems. In Proc. of the 28th FLAIRS Conference 2015.
Viktoriya Krakovna and Finale Doshi-Velez. Increasing the Interpretability of Recurrent Neural Networks Using Hidden Markov Models
Abstract: Following the recent progress in deep learning, researchers and practitioners are recognizing the importance of understanding and interpreting what goes on inside these black box models. Recurrent neural networks (RNNs) have recently revolutionized speech recognition and translation, and could be very useful in other applications involving sequential data. However, adoption has been slow in applications such as health care, where practitioners are reluctant to let an opaque expert system make crucial decisions. If we can make the inner workings of RNNs more interpretable, more applications can benefit from their power.

It is common for neural networks to show human-level performance most of the time, but also perform very poorly on seemingly easy cases. For instance, an imperceptible adversarial perturbation to an image can make a convolutional network misclassify it with very high confidence (Nguyen et al, 2015). We might expect recurrent networks to fail in similar ways. It would thus be useful to have more visibility into where these sorts of errors come from, e.g. which groups of features contribute to such flawed predictions.

Several promising approaches to interpreting long short-term memory (LSTM) RNNs were developed recently. Che et al (2015) use gradient boosting trees to predict LSTM output probabilities and explain which features influenced the prediction, approximating the LSTM architecture as a black box. Karpathy et al (2015) showed that in LSTM language models, around 10% of hidden state dimensions can be interpreted with the naked eye by coloring the text data with the state values; some of them track quotes, brackets and other identifiable aspects of the text. Building on these results, we take a more systematic approach to looking for interpretable hidden states, by using decision trees to predict individual hidden state dimensions. We visualize the overall dynamics of the hidden states by coloring the training data with the k-means clusters on the state vectors.

We explore several approaches to building interpretable models by combining RNNs and HMMs. The existing body of literature mostly focuses on methods that specifically train the RNN to predict HMM states or posteriors, referred to as hybrid or tandem methods respectively (Bourlard&Morgan, 1994). To model the big picture of the state changes in the LSTM, we extract the hidden states and approximate them with a continuous emission hidden Markov model (HMM). To build a model with fewer LSTM states, we take the reverse approach, where an HMM is trained first, and the HMM state probabilities are added to the output layer of the LSTM. We compare this to jointly training the HMM and LSTM. In both the sequential and joint architectures, the LSTM model can make use of the information from the HMM, and fill in the gaps when the HMM is not performing well, resulting in an LSTM with a somewhat smaller number of states. We run experiments on a diverse set of text datasets (Penn Tree Bank, Shakespeare, and Linux Kernel), and find that the LSTM and HMM learn complementary information about the features in the text.
Xiaoyu Lu, Hyunjik Kim, Seth Flaxman and Yee Whye Teh. Tucker Gaussian Process for Regression and Collaborative Filtering
Abstract: Gaussian processes(GP)[1] are flexible Bayesian non-parametric tools that is capable of learning intricate structure through interpretable covariance kernels. However, the computational and storage cost for traditional GP inference isO(N^3) and O(N^2) respectively, where N is the number of data points. Various attempts have been made for scalability, including the inducing points methods [2]. However, as the dimensionality of the data grows, the methods need a large number of inducing points which is prohibitive, and it cannot be generalised to non-Gaussian likelihood models easily. Structure exploiting approach including Kronecker and Toeplitz methods have been proposed for high dimensional data[3], however, these methods require partial grid structured data.

Our approach is based on the “weight space view” which forms the kernel matrix K explicitly. This idea is related to the random Fourier features[4], which explicitly maps the input data to a randomised low-dimensional feature space, such that the inner products of the transformed data are approximately equal to those in the feature space of a shift-invariant kernel. We establish this idea to GP regression. Suppose the kernel is a product over kernels in each input dimension, we find a feature in each dimension such that the draws from the GP is a product between a weight tensor and a set of feature vectors from different dimensions. If the dimensionality is high, the weight tensor might be too big, therefore we use a low-rank tucker decomposition to lower down the number of parameters that need to be learnt. Our model is particularly well-suited to grid-structured data and problems where the dependence on covariates is close to being separable, and it also regularises a GP towards simpler regression functions. We show experimentally that our model gives better prediction performance than GP.

Our method can also be applied elegantly to collaborative filtering, and has a natural way to incorporate side information. The most popular method for collaborative filtering is matrix factorization [5], where the rating matrix R with the (i, j)^th element representing the rating movie j given by user i is modelled as a low-rank matrix factorization R = U^TV , where U is
the user matrix and V the movie matrix. We model the rating R_ij instead as a random function which is drawn from a GP prior. We show that the matrix factorization method is a special case of our model with a specific kernel, which can be chosen to capture additional side information neatly into our framework. Experimentally, out model outperforms Bayesian matrix factorizationn on MovieLens 100k dataset.

[1] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2005.
[2] M. Seeger, C. Williams, and N. Lawrence. Fast forward selection to speed up sparse Gaussian process regression. In AISTATS, 2003.
[3] Y. Saatçi. Scalable Inference for Structured Gaussian Process Models. PhD thesis, 2011.
[4] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS, 2007.
[5] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In NIPS, 2008a.
Marina Riabiz and Simon Godsill. Inference for CAR(p) α-Stable Stochastic Processes via Poisson Series Approaches
Abstract: Time series arising in the natural sciences, engineering and finance are frequently composed of irregularly spaced and high-frequency data that are well represented by continuous-time state-space models. One of the simplest forms of these in terms of analytical tractability is the linear autoregressive (CAR) process driven by Brownian motion. However, its simplicity is limiting the kind of phenomena that it can account for, and therefore a wide literature has been developed to deal with non-Gaussian models.
In particular, our research targets time series with jumps in the trajectories, such as those exhibited by stock prices or insurance losses at the occurrence of extreme events, but examples can be found in many fields, such as communications and acoustics. A commonly used framework to deal with these sudden changes is the jump-diffusion model, that adds to the linear Gaussian diffusion a finite activity jump process.
On the contrary, in our work we keep the linear CAR model, showing that enforcing a non-Gaussian driving noise can account for behaviours similar to those described by jump-diffusions, while allowing for a potentially infinite number of jumps in any finite interval of time. We work with the class of α-stable Lévy-processes because their parameters allow to easily handle extreme values (as well as skewness) and they include the Brownian motion as a special case. The main drawback of this distribution is the lack of closed-form expression for the likelihood, which complicates the inference of the transition densities or of the parameters of the CAR state-space model. A conditionally Gaussian framework describing the α-Stable distribution has recently been formulated by Lemke and Godsill [2] and Lemke [3]. This is based on truncations of the Poisson series representation [4] of stable random variables and on computing the parameters of Gaussian approximations of the residuals. The advantage is that it allows inference using Bayesian computational methods such as Rao-Blackwellized particle filters and Markov chain Monte Carlo (MCMC).
If the CAR model is multidimensional, this allows representing the value of the signal and its derivatives (fundamental e.g. in trend-following financial strategies).
Our contribution is then the extension of the conditionally Gaussian framework to the multidimensional case. We generalise the expressions of the parameters from the CAR(1) to the CAR(p) model, accounting for p-1 derivatives of the signal. Furthermore, we provide simulation results to demonstrate the effectiveness of the conditionally Gaussian multidimensional framework, and give comparisons with the multidimensional jump-diffusion models.

[1] P.J. Brockwell. Lévy-driven CARMA processes. Annals of the Institute of Statistical Mathematics, 53(1): 113–124, 2001.

[2] T. Lemke and S.J. Godsill. Enhanced Poisson sum representation for α-stable processes. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 4100–4103, May 2011.

[3] T. Lemke. Poisson series approaches to Bayesian Monte Carlo inference for skewed α-stable distributions and stochastic processes. PhD thesis, 2014.

[4] G. Samoradnitsky and S. Taqqu. Stable non-Gaussian random processes: stochastic models with infinite variance. Stochastic Modeling Series. Taylor & Francis, 1994.
Roberta de Vito, Giovanni Parmigiani and Barbara Engelhardt. Bayesian Multi-study Factor Analysis in High-dimensional Biological Data
Abstract: Gaining knowledge from high-dimensional studies is a cumulative process that requires integration of multiple, somewhat diverse studies, and relies critically on the methods used to perform the data analysis. Most measurements from high throughput experiments display both artifactual and biological sources of variation (Irizarry et al., 2003}. Models and applications are used to capture and control confounding factors such as population structure
and batch effects (Zhao et al., 2014). Instead, some artifacts reappear across studies, common biological features are more likely to be shared among the studies (Garrett-Mayer et al., 2008).

In this work we develop both theoretical and computational tools for dimension reduction allowing for the joint analysis of multiple high-throughput experiments, simultaneously achieving two goals: a) to capture common component(s) across studies and b) to isolate the sources of variation that are unique of each study. To meet this challenge we introduce “bayesian multi-study factor analysis” (BMFA), a generalized version of Bayesian factor analysis, able to handle multiple studies simultaneously. We implement it using sparse modeling of high-dimensional factor loadings matrices, both common and specific, using spike and slab priors. We propose a computationally efficient algorithm, based on a traditional Gibbs sampler, to produce the Bayes estimates of the parameters and to select the number of relevant common factors.

We assess the operating characteristics of our method by means of simulation studies, and we present an application to the prediction of the biological signal from four gene expression studies on ovarian cancer.
Clara Higuera Cabañes, Katheleen J. Gardiner and Krzysztof J. Cios. Self Organizing Maps based approach for the identification of protein patterns related to learning and memory in control and mouse models of Down syndrome
Abstract: Down syndrome (DS) is the most common genetic cause of learning/memory deficits. It is due to an extra copy of the long arm of human chromosome 21 and the consequent increased expression of the genes it encodes. No pharmacotherapies for learning deficits in DS are available. Because of its high incidence, there is considerable interest in preclinical evaluation of potential drugs in mouse models. In this work, protein expression data from brains of control/healthy and a mouse model of DS (Ts65Dn) exposed to context fear conditioning (CFC), an experiment used to measure associate learning, were analyzed. Protein expression levels were measured in mice with and without treatment with the drug, memantine. Control mice successfully learn the CFC task, while DS mice fail to learn unless they are treated with the drug. In total there were eight classes of mice with different responses to learning and memory. The standard statistical analyses employed in the Ts65Dn CFC studies do not, however, identify several important features, for example, which of the changes seen in control mice are required for successful learning, which of the abnormalities in the Ts65Dn directly contribute to failed learning or which changes induced by memantine are critical for rescuing successful learning in DS mice. In this work we designed a data mining approach based on the unsupervised clustering method Self Organizing Maps (SOM) and the Wilcoxon rank-sum test to analyze this type of data from a novel perspective. There were two main goals: 1) to determine if, using the expression levels of 77 proteins of the different mice as descriptive features, SOM would automatically cluster mice into clusters corresponding to their respective classes, and 2) to identify subsets of proteins that best discriminate between the different classes and also to define genetic and treatment-specific candidate protein responses. This approach identified protein responses that discriminate successful learning in control mice from failed learning in the Ts65Dn mice, and that critically respond to drug treatment. Because the output of SOM is a visual map in which the topology of the data is preserved, the approach also provide a visualization tool to analyze the structure of the data, helpful for biologists to build or corroborate new hypothesis. The results suggest that this approach, applied to additional datasets, can help to identify protein abnormalities in DS mice, and those proteins that need to be altered by drug treatments to facilitate the rescue of learning deficits.

Full text with references is available here: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0129126
Beatriz Remeseiro. On the effectiveness of feature selection in human tear film classification
Abstract: Dry eye syndrome is a chronic, multifactorial disease of the tears and the ocular surface, which has a negative impact on daily tasks, such as driving or working with computers. It has been recognized as a public health problem, which deserves increased attention and resources due to a growth prevalence in the last few years [1]. One of the most commonly used tests to diagnose dry eye is the assessment of the interference patterns observed in the tear film lipid layer. This clinical test consists in classifying these patterns into one of five standard categories, and it has been automatized by means of image analysis techniques and machine learning algorithms.

In this context, feature selection plays an important role in different ways. Firstly, feature selection has been used as a preprocessing step to tackle image analysis, not only to reduce the dimensionality of the problem but also to alleviate the computational burden required to extract the properties from the images [2]. Three popular feature selection filters has been considered to this purpose: CFS, consistency-based filter, and INTERACT. Their effectiveness has been demonstrated, since they allow the automatic test to work in real-time (under half a second), with no degradation in performance (accuracy over 97% and robustness to noise over 92%).

The time needed to compute the image properties used for tear film classification is not homogeneous, i.e. there are features most computationally expensive than others. For this reason, a framework for cost-based feature selection has been also considered [3]. This framework includes a new term added to the evaluation function which updates the weights of the features in order to reach a trade-off between the relevance of a feature and its associated cost, that here means processing time. The filters chosen in this case are CFS, which is a subset filter, and mRMR and ReliefF, which are ranker filters. The best result has been obtained with the cost-based version of ReliefF, which allowed to decrease the computational time to only 7 milliseconds, with an accuracy over 92%.

And finally, the influence of eye colors on tear film classification has been analyzed [4]. Despite the fact that interference patterns do not depend on eye colors, the experts find more difficult to categorize them in light eyes and even suggest the use of two different grading scales, one for dark eyes and the other for light eyes. Therefore, a methodology based on feature selection has been proposed to determine which features are the most relevant ones for each type of eyes, thus improving the performance of the automatic test. Feature selection methods here applied include subset filters (CFS and consistency-based filter) and ranker filters (MIM, mRMR, ReliefF, IG). In this manner, it is possible to automatically classify tear film images with maximum accuracy over 96%, regardless eye colors.
Yousra Bekhti, Daniel Strohmeier, Mainak Jas, Roland Badeau and Alexandre Gramfort. M/EEG source localization with multi-scale time-frequency dictionaries
Abstract: Magnetoencephalography (MEG) and electroencephalography (EEG) are non-invasive techniques for investigating human brain activity. They allow the measurement of ongoing brain activation on a millisecond-by-millisecond basis, which makes them very attractive to study the brain dynamics. Since the neuronal activity is measured at a sensor-level (distributed over the head) the main question is how a brain region can be localized as the one producing the measured activity with reasonable accuracy. This is the so called bio-electromagnetic inverse problem which is ill-posed, meaning there is not a unique solution to the problem. Since the solution to the inverse problem is not unique, constraints need to be set in order to identify an appropriate solution among the multiple possible candidates. The constraints are chosen depending on the assumptions or a priori knowledge based on the characteristics of the source distributions. Common priors are based on the Frobenius norm and lead to a family of methods generally referred to as minimum norm estimates. While these methods have some benefits like simple implementation and robustness to noise, they do not take into account the natural assumption that only a few brain regions are typically active during a specific cognitive task. Also, those methods assume stationary sources, i.e. if the source estimate is active for one time instant, it would be active for the whole time window [1]. The method used here promotes spatial and temporal sparsity and recovers the non-stationary sources and the temporal smoothness [2]. This solver uses the Sparse Group Lasso prior (convex $\ell_{2,1} + \ell_1$ or non-convex $\ell_{2,0.5} + \ell_{0.5}$) in the Time Frequency (TF) domain where it is better defined. Therefore, it must use an appropriate dictionary to decompose the brain signals in the TF domain. We do not address the problem of learning the atoms of the dictionary, as doing so makes the cost function non-convex, which would deteriorate the speed of convergence, and also make the solver dependent on the initialization. Thus due to a mixture of signals, i.e. short transient signals (right after the stimulus onset) and slower brain waves, the choice of a single dictionary explaining simultaneously both signals types in a sparse way is difficult. We introduced a novel method to improve the source estimation relying on a multi-scale dictionary, i.e. multiple dictionaries with different scales concatenated to fit short transients and slow waves at the same time [3].
From an optimization point of view, we used a Block Coordinate Descent solver with an active set strategy to speed up the computation. Moreover, since we used tight Gabor frames for the dictionary, we were able to check if a source would be active to compute the TF decomposition only for the relevant sources.
Our novel method was shown to outperform the basic solver in the time-frequency domain in terms of reduced leakage, temporal smoothness and in detection of both signal types [3].

[1] Strohmeier et al. IEEE TMI, 2016. [2] Gramfort et al. NeuroImage, vol 70, pp. 410-422, 2013. [3] Bekhti et al. PRNI, 2016.
Omotola Dawodu. Data Mining Techniques for Predicting Vehicle Performance
Abstract: In recent times, the importance of data mining techniques to real life situations cannot be overemphasized. Motor vehicle owners are often concerned with what the performance of their vehicles will be as time goes by. This research work takes a look at how motor vehicle owners rate the performance of their vehicles based on different attributes. This work considers the usefulness of data mining classification techniques in identifying owners of motor vehicles who will categorize their vehicles as performing well or not performing well, and also looks at the attributes responsible for their decisions. By applying data mining techniques, potential buyers of such vehicles and those who already have can predict whether their vehicles will perform well or not.
Questionnaire was designed and information on the following: age group, gender, highest educational qualification, income, vehicle engine type, average weekly use, owner’s years of usage, average number of people driving the vehicle, number of time(s) of repairing or replacing the vehicle engine, and vehicle performance, are obtained from sixty respondents who are vehicle owners. This study considers two data mining techniques in WEKA: Bayes classifier and Decision tree. The test option used for both is cross validation in ten folds.
The Bayes classifier used here is Averaged One – Dependence Estimator, which classifies fifty two out of sixty instances correctly, while it classifies eight out of sixty instances incorrectly. Decision tree is able to classify fifty four out of sixty instances correctly. It classifies six out of sixty instances incorrectly. Based on this and other factors such as, the true positive rate, false positive rate, receivers operating characteristic curve, and kappa statistic, decision tree is believed to have performed better than the Bayes classifier.
The summary of results obtained from the decision tree is presented. Motor vehicles of respondents who have not repaired or replaced their engines in the past perform well. Vehicles of respondents who have repaired or replaced their engines only once in the past will not performance well, if the respondents are males, and they drive their vehicle for maximum of two days in a week. It also shows that vehicles of respondents who have repaired or replaced their engines only once in the past will performance well, if the respondents are males, and they drive their vehicles three to four days in a week. It also shows that vehicles of respondents who have repaired or replaced their engines only once in the past will performance well, if the respondents are males, and they drive their vehicles five days and above in a week. It also shows that vehicles of respondents who have repaired or replaced their engines only once in the past will not performance well, if the respondents are females. It also shows that vehicles of respondents who have repaired or replaced their engines twice in the past will not performance well.
Lamiae Azizi. High dimensional spatio-temporal modelling for Brain cell types.
Abstract: Complex tissues, such as the brain, are composed of multiple different cell types, each of which have distinct and important roles, for example in neural function. Moreover, it has recently been appreciated that the cells that make up these sub-cell
types themselves harbour significant cell-to-cell heterogeneity, in particular at the level of gene expression. The ability to study this heterogeneity has been revolutionised by advances in experimental technology, such as Wholemount in Situ
Hybridizations (WiSH) and single-cell RNA-sequencing. Consequently, it is now possible to study gene expression levels in thousands of cells from the same tissue type. After generating such data one of the key goals is to cluster the cells into
groups that correspond to both known and putatively novel cell types. Whilst many clustering algorithms exist, they are typically unable to incorporate information about the spatial dependence between cells within the tissue under study.
When such information exists it provides important insights that should be directly included in the clustering scheme. We have developed a clustering method that uses a Hidden Markov Random Field (HMRF) model to exploit both quantitative measures of expression and spatial information. To accurately reflect the underlying biology, we extend current HMRF approaches by allowing the degree of spatial coherency to differ between clusters.
We also propose an extension to the case where the dynamic of these tissues over time is taken into account. Unlike most analyses in dynamic brain data which focuses
On time curve modelling due to the High dimensional nature of this problem, we propose a full spatio-temporal modelling approach. Our modelling technique has an embedded regularisation technique, which allows an improvement of the clustering result. The estimation of the parameters is done using variational inference, which makes our approach computationally efficient.
We demonstrate the utility of our method using simulated data and real data sets.
Mihaela Porumb. Automatic extraction of relations between medical concepts in clinical documents Text Mining
Abstract: Knowledge extraction from unstructured medical records is an important task in the development of medical decision support systems. The automatic identification of semantic relations between medical concepts can improve the health care services by discovering (possibly unknown) interactions between medical conditions, treatments and investigations. This strategy might help in developing better treatments for certain medical cases, in reducing the duration and cost of diagnosing a patient, in assessing the therapy response in clinical trials, as well as in providing prompt summaries of the patient’s medical history.

While extensive research has been done on the relation extraction task, little work exists on clinical data, mainly due to the scarcity of the resources. The data set used in our research is the data released in the fourth i2b2/VA, 2010 Workshop on Natural Language Processing Challenge. The data set consists of annotated discharge summaries and progress notes; the challenge aimed at recognizing relations between problems, treatments and tests.

In this work, we present a learning-based approach for the automatic discovery of relations, answering the same question posed in the i2b2/VA challenge. Our approach is based on the analysis of the medical documents, in vector representation, by applying Natural Language Processing (NLP) tasks - methodology inspired by the already existing successful systems [1]. We proposed an extended set of features, grouped into four categories: context, lexical, syntactic and grammatical. The problem of relation identification was cast to a multi-class classification problem and was tackled using the Support Vector Machine (SVM) classifier. The most appropriate setup for the multi-class classifier was determined experimentally as a combination of the four categories of features.

The primary contribution of this work was the development of a grammatical feature based on the dependency graph representation of the sentence. The idea behind this feature was that the dependencies contain rich semantic information that can indicate possible relations between the considered concepts. We extracted a similarity feature based on the shortest path between the relation arguments on the dependency graph, by comparing two dependency paths corresponding to different instances. It was shown that by removing this feature, the classification performance drops significantly from 74.9% F-measure to 64.6% F-measure. Thus we demonstrated that the shortest path between the candidate concepts on the dependency path contains important information for the relation identification task. It was found that the categories of features that have the greatest discriminatory value are the lexical and context features. Our model, with F-measure of 74.9 %, outperforms the winning solution presented in the i2b2 challenge by 1.2%.

This work was presented at IIWAS Conference 2015 as a full research paper. It was also published in 2016 in the IJWIS journal [2][3]. No additional analysis has been performed since publication.

[1] http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3168320
[2] Mihaela Porumb, Ioana Barbantan, Camelia Lemnaru, Rodica Potolea:
REMed: automatic relation extraction from medical documents, iiWAS, 2015.
[3] Ioana Barbantan, Mihaela Porumb, Camelia Lemnaru, Rodica Potolea, Feature Engineered Relation Extraction – Medical Documents Setting, IJWIS, 2016.
Alyssa Frazee. Fighting Credit Card Fraud with Machine Learning
Abstract: Stripe powers online businesses around the world by processing billions of dollars a year in payments. However, being in the payments business also means we need to be in the fraud detection business. We use machine learning models to automatically allow or decline charges in real time based on their predicted fraud probability. While building this system, we encountered two problems, counterfactual model evaluation and deterrence, with the shape of the multi-armed bandit problem, in which an agent must choose been "exploring" multiple options and "exploiting" the option it currently believes yields the highest payoff. In the counterfactual model evaluation setting, our challenge is to measure the performance of a model without always observing the true outcome, and in the deterrence setting, we provide misinformation to a fraudster, which disrupts the bandit problem the fraudster is solving. I'll give a brief overview of multi-armed bandits and their variants, including contextual and adversarial bandits. I'll then illustrate how counterfactual model evaluation and deterrence can be posed and "solved" in this framework.
Golnoosh Farnadi. User-Generated Content Data Fusion Framework Using Hinge-Loss Markov Random Fields
Abstract: Nowadays, web users actively generate content in various social media platforms. Modeling social media users by inferring their characteristics such as their age and gender plays an important role in providing personalized services, viral marketing, recommender systems and tailored advertisements.

Various computational approaches for user profiling based on user-generated content (UGC) have been proposed in recent years. Much of these efforts are aimed at finding novel techniques to infer user characteristics using only one source of information such as users’ posts or friendship links. However, in many social media platforms, users can generate content in different modalities such as textual content (e.g., status updates, blog posts, tweets, comments, etc.) and visual content (e.g., photo and video) while connecting with each other (i.e., creating relational content). When aggregated in an appropriate way, these separate nuggets of information can lead to more accurate and more elaborate inferred user profiles. A data fusion model that leverages all available knowledge can learn more accurate user profiles and is especially useful for platforms where not every user generates the same type of information. Examples include users who write status updates but never upload pictures, or users who join social media platforms only to consume knowledge and to relate with each other, rather than producing any textual or visual content themselves. Although much progress has been made in the area of user profiling, the fusion of multi-modal information, is largely unexplored areas of research.

The data fusion model that we propose is based on a statistical relational learning (SRL) model using Hinge-loss Markov Random Fields (HL-MRFs) [1], a class of probabilistic graphical models which are highly scalable and suitable for modeling relational data such as social networks. Our contributions include (1) a general and flexible information fusion model to infer users' characteristics from textual, visual, and relational social media content; (2) two interchangeable probabilistic graphical sub-models for user profiling based on user-item relations; (3) extensive experimental validation of the proposed models for predicting age, gender and personality traits of Facebook users based on their status updates, profile pictures and page likes.
Our experimental results show that our proposed HL-MRFs fusion model efficiently combines various sources of UGC to learn more accurate user profiles. To investigate whether the accuracy gain is due to the use of HL-MRFs or due to leveraging information from multiple sources simultaneously, we have trained a series of models, including single-source logistic regression (LR) models, single-source HL-MRFs models, majority voting ensembles of the single-source models, and LR multi-source models. Our proposed HL-MRFs fusion model provides significantly better predictions of users’ characteristics and outperforms the competing methods. In addition, our proposed framework is flexible and lends itself easily to incorporating other sources of information beyond the ones we consider in this paper.

Note: this work has not been published before.

[1] S. H. Bach, M. Broecheler, B. Huang, and L. Getoor. Hinge-Loss Markov Random Fields and Probabilistic Soft Logic. arXiv:1505.04406 [cs.LG], 2015.
Brenda Betancourt, Daniele Durante and Rebecca Steorts. Random Shades of Colors: Multilayer clustering and community detection in networks
Abstract: Increasingly complex network datasets have motivated an explosion of interest in novel statistical methodologies for network analysis. This is the case of multilayer networks, where multiple types of relationships are measured among a fixed set of nodes. For example, relationships among a set of users on different social networking platforms (Facebook, Twitter, LinkedIn) or interactions among countries with respect to different types of relationships such as economy, finance, military, international security, among others.

We are particularly interested in identifying communities of nodes characterized by similar connectivity patterns while borrowing information across layers. Existing procedures to learn communities in multilayer networks are not flexible enough in accomplishing this task. With this motivation in mind we develop a novel model-based procedure which borrows information across layers in detecting communities of nodes, while allowing the community membership of each node to possibly vary across different layers. This is accomplished by a hierarchical clustering procedure which first groups the layers according to their community structure and then detects the communities in each cluster of layers via stochastic block modeling procedures. This mechanism can be described as a simple procedure based on random shades of colors (RASHAD), which first assigns to each layer a specific color so that layers with same color share the same community patterns. The second stage aggregates nodes to detect the community structure within each cluster, so that nodes with the same shade of color belong to same community shared by layers associated with that color.

The choice of a flexible prior for random partitions plays a relevant role in developing an efficient algorithm for posterior computation in this problem setting. By considering Chinese Restaurant Process (CRP) priors on the cluster assignments at the layer level and communities at the node level (colors and shades, respectively), our model-based procedure flexibly identifies communities of nodes in multilayer networks without forcing a priori any type of block structure. The CRP priors allow us to automatically learn the total number of clusters of layers and communities of nodes in each cluster overcoming a big limitation of common clustering procedures. We illustrate the performance of our model on a real data application of relationships between households in a village of the state of Karnataka, India.

Shuai Li, Baoxiang Wang, Shengyu Zhang and Wei Chen. Contextual Combinatorial Cascading Bandits
Abstract: The problem of multi-armed bandit is usually formulated as a system of K base arms whose rewards are random samples from unknown distributions with unknown means. The learning agent pulls one arm every time and tries to minimize the cumulative regret, which is the difference in cumulative rewards between always pulling the best arm in expectation and playing according to the agent's strategy.

Recently, stochastic combinatorial bandit started to draw much attention. At every time step, a learning agent chooses a subset of ground items under certain combinatorial constraints. There are several different kinds of feedback: (1) bandit feedback, where the learning agent can only obtain the reward of the chosen super arm; (2) semi-bandit feedback, where the learning agent can also obtain the stochastic outcomes of the all base arms constituting the chosen super arm; (3) cascading feedback, where the learning agent can obtain the reward of the chosen super arm and the weights of some base arms in the chosen super arm, according to some problem-specific stopping criterion for observation.

Cascading feedback model fits into many real application scenarios. For example, in online or mobile recommendation, it is a typical practice that an ordered list is recommended to a user, who usually goes over the list based on the recommended order and selects one of her interest to click through. Thus it is reasonable to assume that items before the clicked item is of no interest to the user while user's interest to items after the clicked item is unclear.

We propose the contextual combinatorial cascading bandits and generalize existing works in several directions to cover more realistic scenarios.

First, we incorporate contextual information into cascading bandits. In online recommendation, contextual information includes various user and item information, and user behaviors in different contexts are different. Therefore, utilizing contextual information is crucial for personalized recommendation. In network routing, temporal and spatial contextual information may also be useful to determine better routing paths. Therefore, we incorporate contextual information into the cascading bandit formulation. Second, previous papers treated all positions in the cascading list equally, but in applications different positions may bring different rewards. For example, in online recommendation, we usually prefer users to find their interested item in the list as early as possible to increase user satisfaction. To model these preferences, we introduce position discounts to the cascading bandit, and we show through experiments that incorporating position discounts may significantly improve the learning result. Finally, we generalize the reward functions to more general non-linear reward functions satisfying monotonicity and Lipschitz continuity conditions. This generalization allows us to cover new realistic scenarios.

We design a UCB-type algorithm, C^3-UCB, for this problem, prove an n-step regret bound \tilde{O}(\sqrt{n}) in the general setting, and give finer analysis for two special cases. Experiments on synthetic and real datasets demonstrate the advantage of involving contextual information and position discounts.

Note: This work is presented at ICML 2016 as a full research paper.
Chelsea Finn and Sergey Levine. Deep Visual Foresight for Planning Robot Motion
Abstract: Most standard robotic manipulation systems consist of a series of modular components for perception and prediction that can be used to plan actions. Consider the example of screwing a cap onto a bottle. This might involve first segmenting an observed point cloud into objects, fitting a 3D model to each object segment, executing a physics simulator using the estimated physical properties of objects and the environment, and finally, choosing the actions which screw on the cap successfully. Each component in this complex pipeline is brittle to small changes in the environment, and particularly sensitive to errors early in the process, for example, in estimating the mass of the cap or the frictional coefficient of the bottle’s ridges. Learning-based methods have shown remarkable effectiveness in handling complex, unstructured environments in passive computer vision tasks, but limited success thus far in robotic systems. Motivated by large-scale, unsupervised robotic learning, we consider the question of whether it is possible to replace the hand-engineered robotic manipulation pipeline with a single general-purpose, learned model that connects low-level perception with physical prediction.

In this work, we show that, with deep recurrent models and large quantities of robot experience, robots can indeed learn general-purpose predictive models from raw, sensory observations, and can then use those models to manipulate objects, all with minimal human involvement. We collected a dataset of 50,000 robotic pushing attempts involving hundreds of objects, by recording the experiences from 10 robots operating in parallel. Human involvement only consisted of specifying motor primitives, occasionally swapping out old objects for new ones, and handling mechanical issues. We then trained a deep neural network to predict future video conditioned on the actions taken by the robot, akin to a visual foresight of the future. Our novel video prediction model uses convolutional LSTMs and dynamic neural advection (DNA) to efficiently predict the pixel transformations in the image at each timestep, rather than predicting future pixels directly [1]. As a result, it can reuse object appearance information from previous frames, allowing the model to generalize to previously-unseen objects.

Using the learned predictive, model, a robot can plan to maneuver objects based on probabilistic inference through the model [2]. We define the task as moving a pixel or group of pixels to goal positions, allowing a user to specify how objects should move. At test time, our method plans for the sequence of actions to move pixels as desired. Our results show that the robot can push objects into arbitrary desired positions, even for objects previously not seen at training time. Unlike most methods for robotic learning, our approach can learn in an entirely self-supervised fashion, without a detailed reward function, and to our knowledge, is the first method to demonstrate generalization to controlling novel objects, using model-based control on raw images.

[1] Finn, Chelsea, Goodfellow, Ian, and Levine, Sergey. Unsupervised Learning for Physical Interaction through Video Prediction. NIPS 2016.
[2] Finn, Chelsea and Levine, Sergey. Deep Visual Foresight for Planning Robot Motion. Under Review, arXiv 1610.00696.
Serena Yeung. Learning policies for training visual classifiers from noisy web data
Abstract: Large-scale manually annotated data has proven instrumental in scaling up visual recognition. However, dataset construction remains a very costly and inefficient process. Systems such as OPTIMOL, NEIL and LEVAN attempt to use noisy “webly-supervised” data from search engines to train computer vision models. However, these systems rely on iterative hand-tuned data selection strategies. Instead, we propose a Q-learning formulation for selecting the right examples for training a classifier from noisy web search results. Our method learns policies on a small labeled training dataset and can then be applied to automatically train classifiers for new visual concepts directly from web data. Experiments on MNIST and the challenging Sports-1M benchmarks demonstrates that our method is able to learn accurate visual concept classifiers on both image and video data.
Elena Baralis, Tania Cerquitelli, Silvia Chiusano and Evelina Di Corso. Towards Self-Learning Data Transformation
Abstract: Large volumes of data are being collected at an ever increasing rate in various modern applications, ranging from social networks to the scientific computing and smart environments. Since they are generated by a large variety of events, real datasets are usually characterized by an inherent sparseness. Furthermore, the features used to model real things/objects and human actions may have very large domains and variable distributions. The variability in data distribution increases with data volume, thus increasing the complexity of data analytics. Data-driven analysis is a multi-step process, in which data scientists tackle the complex task of configuring the analytics system to transform data into actionable knowledge. Until now, a plethora of algorithms are available for performing a given data analysis phase (e.g., data transformation) but the algorithm selection is usually tailored to the data under analysis.
In many analytics processes tailored to sparse data collections, like collections of documents and medical treatment collections, suitable transformations of input data need to be explored to gain insights from data, reduce sparseness, and make the overall analysis problem more effectively tractable. Furthermore, different weighting functions (e.g., term/item frequencies, GF-IDF) can be exploited to highlight the relevance of specific objects in the collection. However, different methods exist and the selection of the optimal ones is guided by the domain expert.

In this work we argue towards a new self-learning engine able to suggest to the analyst good transformation methods and weighting schemas for a given data collection which yield higher quality knowledge. This new generation of systems, named SELF-DATA (SELF-learning DAta TrAnsformation) rely on:
(i) an engine capable of characterizing data distributions through various indices (e.g., hapax legomena, Guiraud’s index of lexical richness), exploring different data weighting strategies (e.g., normalized term frequencies, logarithmic entropy) and data transformation methods (e.g., PCA, LSI) before applying a given data mining algorithm (e.g., cluster analysis), evaluating and comparing solutions through different quality indices (e.g., WSSSE, rand index, f-measure, precision, recall),
(ii) a knowledge database storing results of experiments on previously processed datasets, including data characterization and the selected results,
(iii) a classification algorithm trained on the knowledge base content to forecast the best methods for future analyses.

We implemented a preliminary version of SELF-DATA on the Apache Spark framework, supporting parallel and scalable processing and different data transformation analytics activities. It is able to characterize the data distribution through different quality indices and perform all the tests combining a given weighting strategy with a data transformation method before applying the cluster analysis through K-means. The identified solutions are compared, and ranked in terms of quality of the extracted knowledge (i.e., quality of the discovered clusters). For each analyzed dataset, the 2-top solutions are selected and stored in the knowledge base. The preliminary validation performed on 10 collections of news highlight that the term frequency and logarithmic entropy weighting methods are effective to better measure item relevance with very sparse datasets, and the PCA method outperforms LSI in presence of a larger data domain.
Gundula Povysil, Djork-Arné Clevert and Sepp Hochreiter. HapRFN: a deep learning method for identifying short IBD segments
Abstract: A segment of DNA is called identical by descent (IBD) in two or more individuals if it is identical because it was inherited from a common ancestor. IBD segments can be used to uncover hidden familial relationships, detect the population of origin of an individual or find interbreedings between humans and ancient hominins like the Neandertal. IBD segments can also be used to find the cause of diseases via a technique called IBD mapping.

For the above applications HapFABIA was shown to be superior to other IBD detection methods by detecting short IBD segments that are tagged by rare DNA variants via biclustering. Nevertheless, HapFABIA still has several problems: (1) To decide whether individuals possess an IBD segment is often difficult because of the soft bicluster membership supplied by HapFABIA. (2) HapFABIA can only extract 10-30 IBD segments at once and therefore needs to perform multiple iterations. However, the IBD segments identified in different iterations may not be decorrelated, thus they may be redundant and overlapping or even split into smaller segments. (3) Processing very large data sets is time intensive.

We recently introduced “Rectified Factor Networks” (RFNs) as an unsupervised deep learning approach. Each code unit of the RFN represents a bicluster and therefore an IBD segments, where samples for which the code unit is active share the bicluster (IBD segment) and features (DNA variants) that have activating weights to the code unit tag the IBD segment. HapRFN overcomes the problems of HapFABIA. (1) RFNs provide sparser codes via their rectified linear units that immediately supply bicluster memberships as factors being different from zero. (2) RFNs can learn thousands of factors and therefore many IBD segments simultaneously. Therefore, all IBD segments are mutually decorrelated, thus are not redundant and do not overlap. (3) RFNs allow for much faster processing of very large data sets using techniques from deep learning like efficient matrix multiplications and implementations of networks on graphical processing units (GPUs).

To keep feature membership vectors sparse, we introduce a Laplace prior on the parameters. Therefore, only few features contribute to activating a code unit, that is, only few features belong to a bicluster. In order to enforce more sparseness of the sample membership vectors, we introduce dropout of code units. Dropout means that during training some code units are set to zero at the same time as they get rectified. Dropout avoids co-adaptation of code units and reduces correlation of code units.

As a result HapRFN makes it possible to process very large data sets and to determine the size and number of IBD segments more precisely. With HapRFN we are able to accurately detect familial relationships, populations of origin, or interbreeding with ancient genomes in data sets with thousands of individuals. Furthermore, finding disease associations via IBD mapping becomes more reliable which might be the key to uncover unknown hereditary causes of multifactorial diseases.
Gulce Kale and Oznur Tastan. Early Diagnosis of Cancer from Volatile Organic Compounds
Sakinat Folorunso and Adesesan Adeyemo. Alleviating Classification Problem of Imbalanced Dataset
Abstract: The Class Imbalance problem occurs when there are many more instances of some class than others. i.e. skewed class
distribution. In cases like this, standard classifier tends to be overwhelmed by the majority class and ignores the minority
class. It is one of the 10 challenging problems of data mining research and pattern recognition. This imbalanced dataset
degrades the performance of the classifier as accuracy is tendered towards the majority class. Several techniques have
been proposed to solve this problem. This paper aims to improve the true positive rate/ detection of the minority class
(GDM) which is the class of interest. This study proposes the use of two under sampling techniques reported in the
literature. It involves under sampling the majority class which balances the dataset before classification. These under
sampling schemes were evaluated on three learning algorithms (Decision tree both pruned and un- pruned and RIPPER)
using Matthew’s Correlation Coefficient (MCC) and Kappa Statistics as metrics. The implementation of these under
sampling techniques was assessed in the medical domain. The real life dataset collected contained 886 instance of
patients with diabetes mellitus disease. The diagnosis was in three classes with the following class distribution TYPE1
containing 62 instances, TYPE2 containing 807 instances and Gestation Diabetes Mellitus (GDM) which is the class of
interest containing only 17 instances. This study revealed that, compared with the original dataset and RUS dataset, NCL
dataset presents superiority in achieving better true positive rate for the minority class and also high MCC and KStatistics
with the three learning algorithm.
Mina Rezaei, Haojin Yang and Christoph Meinel. Brain tumor detection by deep convolutional neural network
Abstract: During the past years, deep learning has raised a huge attention by showing promising results in some speech recognition, hand written character recognition, image classification, detection and segmentation, etc. There are expectations that deep learning improve medical image analysis applications, such as computer aided diagnosis, image registration and multimodal image analysis, image segmentation and retrieval. There has been some applications that using deep learning in medical application like cell tracking and organ cancer detection. Doctors use magnetic resonance images as effective tools to diagnosis the diseases. Computer aided medical diagnosis can perform fast and objective with high robustness and reliability to support the health system. The brain is particularly complex structure, and analysis of brain MR images is an important step for many patients.
We propose an approach for tumor high and low grade glioma classification. Our network is based on Fast-RCNN[3] but we make it six times faster. The model introduces multiple built-in subnetworks which detect Tumor types with location(s). Our network gets multi multimodal MR images as input in training and testing. We consider T1 contrast, flair, and T2 as red, blue, green channels and input them as one time to make the network 3 time faster. We map the sagittal, axial and coronal planes in volume of interest to Red, Green and Blue channels. Same as Roth et al. [4] we consider volume of interest of planes by following [4] In both types of input we make network six times faster by combination of modalities and planes. We use the VGG16 model [3] to initialize our network, which is used in the most recent state-of-the-art method. The first seven convolutional layers and three max pooling layers of the VGG16 network are used as the shared convolutional layers before the two sub-networks to produce feature maps from the entire input image. Following the Fast R-CNN, the last max pooling layer of the VGG16 network is replaced with the region of interest pooling layer to pool the feature maps of each object proposal into fixed resolution that we set as wi/9×hi/9 same as SPP net. The final fully connected layer and softmax are replaced with two sibling fully-connected layers. For evaluation we apply our model onand BRATS-2015 dataset which contains 220 subjects with high grade and 54 subjects with low grade tumor. We have achieved dice 0.72, 0.89 sensitivity and 94.3 % accuracy for whole tumor detection.
[1] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: convolutional architecture for fast feature embedding. In ACMMM, pages 675–678. ACM, 2014
[2] R. Girshick. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, 2015.
[3] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014
[4] HR Roth, L Lu, A Seff, KM Cherry, J Hoffman, S Wang, J Liu, E Turkbey A new 2.5 D representation for lymph node detection using random sets of deep convolutional neural network observations,In MICCAI 2014

Duygu Ozcelik and Oznur Tastan. A Weakly Supervised Clustering Method for Cancer Subgroup Identification
Stephanie Hyland and Gunnar Ratsch. Learning unitary operators with help from u(n)
Abstract: Training deep and recurrent neural networks (RNNs) can be challenging due to exploding or vanishing gradients. This is especially problematic for modelling data with very long term dependencies. The popular long-short term memory (LSTM) recurrent neural network was originally proposed to better retain information and circumvent gradient instability issues. An alternate approach under recent consideration is to use unitary or orthogonal operators for the transition matrix of the RNN. Such operators have a bounded eigenvalue spectrum, ensuring gradients cannot explode. However, using such operators introduces an issue for traditional gradient-based learning, as additive updates in general do not preserve orthogonality (or unitarity). To address this, we describe a parametrisation of unitary matrices based on the Lie group-Lie algebra correspondence. Each unitary matrix (an element of the Lie group of unitary matrices) can be identified with at least one element of the corresponding Lie algebra. This algebra forms a vector space (it is the tangent space of the group at the identity) and is therefore closed under addition. We use the coefficients of the corresponding element of the algebra (relative to a basis of the algebra) as the parameters of the unitary matrix, and can use traditional (stochastic) gradient descent for optimisation, retaining the unitarity of the matrix at all times.

We have studied the use of this parametrisation on a simple supervised learning problem, where vectors and their matrix-transformed counterparts are provided as examples and targets. We find that our parametrisation recovers the ground truth unitary matrix with low loss, outperforming another recently-proposed (restricted) parametrisation of unitary matrices, as well as an approach of re-projecting to a unitary matrix after each batch. We also describe a mathematical trick to calculate gradients of our parametrisation. Work to integrate this parametrisation into an RNN is underway.

A version of this work is available as a preprint: https://arxiv.org/abs/1607.04903
Vanessa Gómez-Verdejo, Sergio Muñoz-Romero and Jerónimo Arenas-García. Regularized Multivariate Analysis Framework for Interpretable High-Dimensional Variable Selection
Abstract: Multivariate Analysis (MVA) comprises a family of well-known methods for feature extraction which exploit correlations among input variables representing the data, this is the case of well-known algorithms such as Principal Component Analysis (PCA), Canonical Correlation Analysis (CCA), or Orthonormalized PLS (OPLS). One important property that is enjoyed by most such methods is uncorrelation among the extracted features, so that the feature extraction process provides additional advantages: (1) The relevance of each extracted feature is directly given by the magnitude of its associated eigenvalue, simplifying the selection of a reduced subset of features; (2) Subsequent learning tasks are simplified, more notably, when the covariance matrix inversion is required.
Recently, regularized versions of MVA methods have appeared in the literature, mainly with the goal to gain interpretability of the solution. Some of the most significant contributions in this direction are sparse PCA, sparse OPLS, Sparse Reduced Rank Regression (SRRR), and l2,1-regularized CCA (or L21SDA).
All these approaches are based on an iterative process which combines the optimization of two coupled least-squares problems, one of them subject to a minimization constraint. Since the sparse PCA proposal, this constrained least-squares minimization has been typically treated as an orthogonal Procrustes problem. 
A first objective of this paper is to highlight and make the computational intelligence community aware of some limitations derived from the use of orthogonal Procrustes in the context of regularized MVA methods. As explained in [1], these methods:
1) do not converge to their associated non-regularized MVA solutions when the penalty term is removed,
2) are highly dependent on initialization, and may even fail to progress towards a solution,
3) do not in general obtain uncorrelated features.
As solution to these problems, [1] proposes an alternative optimization procedure avoiding the use of the Procrustes solution. Starting from the framework presented in [1], a second contribution of this work relies in introducing a novel MVA method using the l2,1 norm as regularization term. In this way, the projection vectors are favoured to include zeros in some of their components, making it easier to understand the process to build the new features and thus gaining in interpretability. Moreover, this norm holds the property of rotational invariance, a fact that we will exploit to significantly reduce the computational cost of the training phase. Although some authors have already adapted the robust variable selection method to the MVA scenario, these adaptations are based on orthogonal Procrustes and the rotational invariance property of the l2,1 norm is not exploited, taking unnecessary extra computational burden.
Experimental results over high dimensional problems show that the methods included in this MVA framework are not only computationally more efficient than previous state of the art solutions, but also can improve their performance.

This work has just been accepted for its publication at the IEEE Computational Intelligence Magazine

This work has been supported by MICYT project TEC2014-52289-R.

[1] S. Muñoz-Romero, V. Gómez-Verdejo, and J. Arenas-García, “Why(and how) avoid orthogonal procrustes in regularized multivariate analysis,” arXiv preprint, arXiv:submit/1555588, 2016. 

Farah Colchester, Heloise Marais, Patrick Thomson and David Clifton. Early warning prediction system for handpump failure
Abstract: Groundwater is the main source of water for many people in rural sub-Saharan Africa. The water is often accessed using handumps on top of sealed boreholes, as this avoids contamination. Unfortunately, it is estimated that one third of handpumps are out of use at any given time which forces people to travel further for water or to use water sources that are less safe to drink.
A data-driven maintenance system whereby handpumps were assigned to a mechanic to repair led to a ten-fold decrease in the down-time of a pump to two days. However, even short-term consumption of contaminated water can have significant adverse health impacts.

Predictive condition monitoring is an engineering approach used to detect infrastructure damage as early as possible and, ideally, before any downtime of the system occurs. The approach is widely used in engineering from commercial and military jet engines[1] to patient monitoring in health systems[2,3], and can directly extend to monitoring the condition of handpumps in rural villages. However, no similar system has been applied to handpumps.

We embed high-frequency accelerometers into the handles of the pumps to collect pumping and vibration data. We apply a low-pass filter to the accelerometry time-series to remove the effects of the pumping motion so as to focus on the vibrations of the handpump. We divide the resulting time-series into short overlapping windows and apply the fast-Fourier transform to each. This gives information on the frequency components of the time series, from which feature vectors are extracted. As the dominant frequency components vary between pumps, we present methods for automatically detecting the relevant features for each pump.

Using feature vectors from a pump in working mode, we build a model of ‘normality’ that encapsulates the behaviour when the pump is working well. This is done using one-class support vector machines (SVMs) which define a boundary around the space of normal data, against which new data can be compared. Data that falls outside of the boundary is possibly abnormal, and its distance from the boundary can be used as a novelty score. Extreme value theory can be used to detect if a window of data with high novelty scores are such that would be expected by chance in a working pump, or whether it indicates novel behaviour, i.e. a failure, in the pump.

Initial proof-of-concept results on datasets collected from a number of pumps in Kenya show that these methods detect changes in the handpumps. Further work is required to ascertain sensitivity and specificity levels in detecting abnormal behaviour. Ultimately, we aim to roll out a low-cost accelerometery and messaging system which can be installed in a handpump with condition monitoring algorithms that would analyse vibration spectra to alert authorities to a problem before the pump becomes out of use. This technology could greatly improve people’s access to safe drinking water in many parts of the developing world, leading to improved health and development outcomes.

For references, please see uploaded PDF
Wiebke Koepp, Sebastian Urban and Patrick van der Smagt. A Novel Transfer Function for Continuous Interpolation between Summation and Multiplication in Neural Networks
Abstract: In this work, we present the implementation and evaluation of a novel adaptive transfer function for use in artificial neural networks. It allows the continuous change between summation and multiplication for the operation performed by a neuron. While the output of conventional neurons is computed as a weighted sums of its inputs, multiplicative neurons involve a product of the inputs with weights included as powers. Utilization of logarithmic identities and continuously differentiable fractional iterates of the exponential function yield an adaptive transfer function with one additional parameter per neuron and layer. This parameter can be determined along weights and biases during standard, gradient-based training and therefore eliminates the need for a deliberate decision regarding the performed operation for each neuron prior to learning.

With the idea first introduced in [1], one of the contributions of this work is the implementation of two different approaches for fractional iteration of the exponential function for integration within the widely used machine learning library Theano. Also, we address the trade-off between accuracy and computational efficiency in computationally intense operations by both error-analysis and interpolation of the involved functions. Furthermore, the proposed transfer function is evaluated by comparing its performance to conventional transfer functions for various regression problems. Here, we mainly focus on datasets that incorporate multiplicative interaction between inputs, as we expect these will benefit most from the novel transfer function. Indeed, interpolation between summation and multiplication achieves comparable or even slightly better results, outperforming the latter on a task designed to assess long-range generalization performance. However, this comes at the cost of increased training time.

Further investigation of the proposed transfer function is necessary to issue a definitive judgement on its usefulness. Future work includes its evaluation on other datasets and investigation of regularization approaches specifically for the introduced parameter as well as the comparison to another recently proposed transfer function [2] with a similar objective.

Note: A full text version of this work is available [3], parts of which have been presented within the Workshop track of ICLR 2016 [4]

[1] S. Urban and P. van der Smagt (2015). “A Neural Transfer Function for a Smooth and Differentiable Transition Between Additive and Multiplicative Interactions”. arXiv: 1503.05724 [stat.ML].
[2] L. B. Godfrey and M. S. Gashler (2015). “A Continuum among Logarithmic, Linear, and Exponential Functions, and Its Potential to Improve Generalization in Neural Networks”. arXiv: 1602.01321 [cs.NE].
[3] W. Koepp (2015). "A Novel Transfer Function for Continuous Interpolation between Summation and Multiplication in Neural Networks". Master's Thesis. Stockholm, Sweden. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-184087
[4] W. Koepp, P. van der Smagt and S. Urban (2016). "A Differentiable Transition Between Additive and Multiplicative Neurons". arXiv: 1604.03736 [cs.LG].
Niranjani Prasad, Barbara Engelhardt, Li-Fang Cheng, Corey Chivers, Michael Draugelis and Kai Li. A Reinforcement Learning Approach to Weaning of Mechanical Ventilation in ICU
Abstract: The majority of patients admitted to intensive care units undergo mechanical ventilation [Kirton, O. (2011)], and are administered intravenous sedatives while ventilated in order to maintain comfort and physiological stability. For these patients, 40% of the time spent on the ventilator is dedicated to weaning, but opinion on the best protocol varies among clinicians. Minimizing the duration of ventilation can greatly improve patient outcomes and reduce hospital costs, while premature weaning, which may lead to reintubation, is associated with a seven-fold increase in mortality rate.

In this work, we propose a reinforcement learning approach to finding an optimal, personalized policy for the control of sedation and weaning of ventilated patients over the duration of the ICU stay. We frame this sequential decision making problem as a Markov decision process (MDP); the state of the process at each time step is defined by the patient's demographic characteristics and a combination of observations of informative physiological measures and oxygenation parameters that signal a patient's readiness for extubation. At each time step, an action is chosen from a discrete action list: intubation, extubation, administration of one of a range of sedatives at different dosages, or of a spontaneous breathing trial. The reward function for the MDP is designed with the objective of maintaining vitals and oxygenation parameters within desired ranges and minimizing the total duration of ventilation, while heavily penalizing reintubation. We incorporate prior information in the form of rules-of-thumb used by clinicians when refining the features of the state space and in reward shaping, to facilitate the learning of an optimal policy.

Fitted Q Iteration (FQI) [Ernst, D. et al (2005)] is a form of batch-mode reinforcement learning that works by solving a sequence of supervised learning problems - each with the full set of one-step transitions - to learn a mapping from state-action transitions to expected rewards, and recursively updating the Q-function estimate. The algorithm is therefore able to more efficiently use limited logged data. FQI with extremely randomized trees, a tree-based ensemble method for supervised regression problems, is used here as the function approximator as it has been shown to be effective in modeling large, continuous Q-functions in clinical settings. [Ernst, D. et al (2006); Escandell-Montero, P. et al (2014)]

Experiments were run on a subset of 300 pneumonia patients from the freely available MIMIC III critical care database. Chart measurements were interpolated and resampled into 1 hour intervals, to yield a set of approximately 24,000 one-step transitions. Initial evaluation of the policy learnt by FQI suggests an expected mean reward 20% greater than than from the original data. One of the challenges of policy evaluation in this setting is the left-censoring of the recorded data - it is unclear whether a patient would have been able to breathe independently prior to the true extubation time. We explore methods for clustering similar states across different patients, as well as scaling up the amount of training data, to tackle this through fuller exploration of the state-action space.
Nicole Mücke and Gilles Blanchard. Convergence rates of spectral methods for statistical inverse learning problems
Abstract: Consider an inverse problem of the form g = Af, where A is a known operator between Hilbert spaces of real-valued functions on a space X, and assume that we observe g at some randomly drawn points X1,...,Xn which are i.i.d. according to some distribution PX. Each observation is subject to a random independent noise, i.e. Yi = (Af)(Xi) + εi, i = 1,...,n. The goal is to recover the function f. Here it is assumed that for each point x the evaluation mapping f → Af(x) is continuous. This setting as well as its relation to random nonparametric regression and statistical learning with reproducing kernels has been studied more recently in a series of works by [2,3]. In particular, this setting is geometrically equivalent to the random nonparametric regression model Yi = h(Xi)+εi,i = 1,...,n, wherein it is assumed that h belongs to a reproducing kernel Hilbert space, and the goal is to recover the function h with the estimation error being measured in HK-norm (as opposed to the L2(PX) norm for the standard least squares regression setting). We consider the estimation of f from the observations by spectral methods. Our results concern convergence rates under Hölder source conditions and polynomial decay condition of eigenvalues, both with respect to an appropriate integral operator defined from A and the marginal distribution PX. These results extend previously known ones, in particular the minimax optimal rate for the convergence in H1-norm (or equivalently in HK-norm for the equivalent random design regression model) as well as a corresponding minimax lower bound, both of which had not been established for this setting (see [1]). We also extend our results to distributed learning.

[1] G. Blanchard, N. Mücke, Optimal rates for regularization of statistical inverse learning problems, arXiv Preprint (1604.04054), 2016
[2] A. Caponnetto, Optimal rates for regularization operators in learning theory, Technical report MIT- CSAIL-TR-2006-062, Massachusetts Institute of Technology (2006).
[3] F. Bauer, S. Pereverzev, L. Rosasco, On regularization algorithms in learning theory, J. Complexity 23(1): 52-72 (2007)
Maria Francesca and David Gregg. Computing Convolution and Activation in the Frequency Domain
Abstract: Convolution is a common operator in convolution networks which reduces to point-wise multiplication in the Frequency domain under several transforms, including Fourier, Laplace and Z. Previous work has found that computation of convolution in the frequency domain provides a significant speedup versus traditional convolution network implementations because of this. However, the performance increase comes at the expense of repeatedly computing the transform and its inverse in order to apply other network operations such as activation, pooling, and dropout. Significant strides have been made by other researchers, especially for spectral pooling, but activation remains an open issue. Multiple articles on the topic have cited the lack of an appropriate representation of many common activation functions in the frequency domain to be a significant block to future work in this area.
We show mathematically how convolution and activation can both be implemented in the frequency domain. In order for the spectral representation of a function to exist, its integral from negative infinity to positive infinity must also exist, and not tend to infinity in the limit. This has been a problem for most activation functions, as their integrals do not meet these criteria. However, by considering the support of an activation function being applied in practice to a finitely large, and finitely ranged layer or image, it is possible to provide a spectral representation under many transforms of these activation functions. We mainly discuss the ramp function, however the approach may be extended for any activation function with finitely many discontinuities which do not tend to infinity over a finite domain. This means that the approach covers all sigmoid functions.
By applying the activation function in the frequency domain, we replace, for each convolution layer, two transforms (which, in the case of the discrete Fourier transform are O(nlogn) operations) with a single function of similar complexity. Although the complexity is now identical to approaches which run only convolution in the frequency domain (which is faster than conventional implementations), our approach also removes the requirement to swap between representations. This reduces the amount of memory required to run spectral convolution and activation, as there is no longer any swapping between a real valued representation of the network and a complex valued one except at boundaries defined by custom architectures for which spectral representations have not yet been defined. Our description of a spectral activation function, together with existing spectral analogs of other network functions may then be used to compose a fully spectral implementation of a convolution network.
Anastasia Pentina and Ruth Urner. Lifelong Learning with Weighted Majority Votes
Abstract: Better understanding of the potential benefits of information transfer and representation learning is an important step towards the goal of building intelligent systems that are able to persist in the world and learn over time. In this work, we consider a setting where the learner encounters a stream of tasks but is able to retain only limited information from each encountered task, such as a learned predictor. In contrast to most previous works analyzing this scenario, we do not make any distributional assumptions on the task generating process. Recent work of Balcan etal (COLT’15) provides the first analysis of this setting for the case of learning linear predictors and their linear combinations under the assumption of marginal distributions of all the observed tasks being isotropic log-concave. In this work we show that substituting linear combinations of classifiers by the weighted majority votes allows us to drop the log-concavity assumption, while retaining similar theoretical guarantees. In particular, we provide a lifelong learning algorithm with error guarantees for every observed task (rather than on average). We formulate a complexity measure that captures the diversity of the observed tasks and demonstrate that the proposed algorithm leads to sample complexity reductions in comparison to solving every task in isolation in terms of this measure. Further, our algorithmic framework can naturally be viewed as learning a representation from encountered tasks with a neural network.
Note: This work has been accepted for publication at NIPS 2016.
Cecília Nunes, Anders Jonsson, Oscar Camara and Bart Bijnens. A decision tree approach for imprecise data
Abstract: The development of effective prediction models depends on the quality of the data. Medical data is heterogeneous, noisy, incomplete and often has a limited number of examples. In clinical data, precision errors arise from distinct diagnostic practices, workstations with manufacturer-dependent technology, or distinct modalities available to obtain an insight.

Decision trees (DT) are an interpretable learning algorithm, with acknowledged benefits in various domains. Unlike recent machine learning algorithms, interpretable models are important in safety critical contexts such as medicine, where decision trees are already used in guidelines. However, DTs are insensitive to the variability of the input data. Learning from uncertain data yields poor generalization. Evaluating noisy data leads to inaccurate predictions because small variations around split values can drastically change the outcome.

Several methods have been proposed to make DTs robust to uncertainty by weighting the contribution of all child nodes to the predictions, with benefits in tree size and accuracy. A qualitative benefit of this approach is that all the child nodes contribute to the final decision, allowing the analyst to consider different decision paths. Two of the works with accuracy improvements considered Gaussian and uniform distributions for the uncertainty. They however entailed high computational costs, and one of them did not consider the uncertainty model during training. Another approach with reduced tree sizes employed a sigmoid model in a multivariate-split tree, therefore limiting interpretability.

In this work, we present a probabilistic DT, where the weights of the branches to the class prediction are determined by the noise distribution. The precision of each numerical variable is represented by a normal distribution. The method separates the uncertainty representation in the split search and in the evaluation phases to allow for investigation of their separate impact. When learning the split values, the search is done on a filtered version of the training data. When propagating the training and test instances, we employ a probabilistic approach that weights the contribution of all subtrees to the prediction. Moreover, the filtered search improves upon previous work by limiting the number of entropy computations, while maintaining a large search space.

With the application to clinical data mining in mind, the merits of the method are evaluated with regard to prediction accuracy and tree size. Four clinical datasets and one synthetic dataset were used and several degrees of noise were added. The main result concerns the reduction is tree size when using the filtered search or soft training propagation. Soft evaluation revealed increased robustness of accuracy to different noise levels.
Allison Chaney, Young-Suk Lee, Olga Troyanskaya, David Blei and Barbara Engelhardt. Generalized Nonparametric Deconvolution Models
Abstract: We consider the problem of modeling collections of convolved observations. Specifically, each feature of an observation is the sum of particles that originate from distinct factors. The objective of this work is to create a general family of models to learn 1) the features of global factors shared among all observations and the number and global proportions of these factors; 2) for each observation, the proportion of particles that belong to each factor; and 3) the features of observation-specific (or local) factors for each observation. While the first two objectives are fulfilled by existing models, the final objective, which we will call deconvolution, is unique to our model.

Consider RNA sequencing data. Each observation is a collection of cells that have been sequenced together. Each cell has its own RNA, and the expression levels across thousands of diverse cells are convolved together into a single observation. For a given cell sample, scientists would like to identify the proportions of different types of cells. For instance, a sample of blood cells will contain white blood cells, red blood cells, and platelets. Scientists would also like to identify RNA expression levels characterizing each cell type. Finally, knowing the RNA expression levels by cell type for each sample is useful to determine which cell types are healthy or diseased for that sample.

This same structure exists in data from many disciplines, including political voting among different demographics, fMRI scans of neuron activity, and financial investment across a variety of stocks and investors. For each of these domains, observations live in some high dimensional feature space and are generated from a convolution of their constituent unobserved particles. Modeling local structure has distinct advantages, both in terms of predictive performance and interpretability.

We present generalized nonparametric deconvolution models (NDMs), a family of Bayesian nonparametric models for collections of data, in which each observational unit is comprised of heterogeneous particles. NDMs use the hierarchical Dirichlet process to discover the unknown number of latent factors that characterize a data set, and they model each observation as a weighted average of these latent factors. Unlike existing models, however, NDMs are able to recover the factor-specific fluctuations for each observation. This allows us to deconvolve each observation into its constituent factors and describe how these sample-specific factors deviate from their corresponding global factors. We present scalable variational inference techniques for this family of models and study its performance on data from a variety of domains.
Saadia Qamar, Sana Ikram and Mehak Nadeem. Human Computer Interface using EEG/EOG Signals
Abstract: ABSTRACT
Nearly 6 million people (Christopher & Dana Reeve Foundation) were found to be
living with paralysis in US alone. It was difficult for them to use the available
standard computer interfaces. Existing interfaces such as BCI make use of pairs of
electrodes to record eye movement. The aim of this study was to develop a user
interface specifically suited for quadriplegic community that incorporated only a
single dry electrode to acquire raw EEG/EOG signals. In this study, 2000 EEG/EOG
signals for eye and facial movement were collected to form a data set. Features for
left, right, up, down and blink eye movements and jaw clenching were extracted and
classified using Linear Discriminant Analysis and Adaptive Boosting. The
classification was implemented in real time to detect EEG/EOG eye movement from
raw signals. These signals were used to control the cursor on the screen and other
electrical devices in the environment. This study investigated that six different types of
signals could be distinguished with sufficient accuracy. The left and down eye
movement signals were partially intermixed, which was true for up and right signals
as well. Voltage thresholds in combination with classifiers were used to overcome this
shortcoming. The system was implemented in real time to control the cursor on the
screen in vertical and horizontal directions. These signals were also used to switch on
electrical devices to control the room environment. The study concludes that the
proposed model is suitable to enable the quadriplegic community to use technology in an indoor living atmosphere independently.
Yu-Hui Huang, José Oramas, Tinne Tuytelaars and Luc Van Gool. Do Motion Boundaries Improve Semantic Segmentation?
Abstract: Precise object localization is crucial to many computer vision tasks especially for semantic segmentation. State-of-the-art methods have difficulties predicting pixelwise class labels, particularly for small objects. Some researchers proposed to use object boundaries in a post-processing step to alleviate these problems [3]. Some leveraged motion information from optical flow to improve the segmentation [4]. However, these methods need either more computation time or the tuning of hyper-parameters. To get the best from both worlds, we propose to leverage motion boundaries in a neural network based model to efficiently improve semantic segmentation. Motion boundaries, derived from the abrupt changes in optical flow, can serve as a proxy for object boundaries. They reveal the locations of occlusions and object boundaries at the same time. In this work, we investigate how useful these motion boundaries are in improving semantic segmentation.

As there is no dataset readily available for this task (with annotation for both motion boundaries and pixelwise class labels), we compute the motion boundary maps with a pre-trained detector using the structured random forest from [1]. With these motion boundary maps and the corresponding RGB images, we train a convolutional neural network end-to-end, for the task of semantic segmentation. We choose SegNet [2] as our base model due to its efficiency in both computation time and memory consumption. Moreover, instead of using the motion boundaries as an additional modality in a late fusion post-processing step, we propose to integrate motion boundaries as part of the input thus allowing the network to learn from them. During inference, our proposed model takes about 6 seconds (including CPU-based motion boundary computation) to process an image with a resolution of 320 by 480 pixels.

We conduct experiments on the CamVid dataset where we compare our proposed method to different motion inputs (see [5] for more details). Experimental results show that our proposed network has learned to incorporate the motion cues and that these improve object localization. Our proposed method is 4% superior in class average than the original SegNet. At the same time, its performance is superior for small-scale classes like traffic sign (+29%), pedestrian (+11%) and pole (+11%). Compared to the integration of other motion cues, our method achieves comparable class average performance and superior mean I/U which is a stricter metric. These results show that motion boundaries can complement RGB data especially for the cases of small-scale classes when training a network. This suggests that the inclusion of motion boundary maps can indeed improve semantic segmentation.

[1] Weinzaepfel et al., Learning to detect motion boundaries. In CVPR 2015.

[2] Badrinarayanan et al., SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. arXiv:1511.00561, 2015.

[3] Sevilla-Lara et al., Optical Flow with Semantic Segmentation and Localized Layers. In CVPR 2016.

[4] Bertasius et al., Semantic Segmentation with Boundary Neural Fields. In CVPR 2016.

[5] Huang et al., Do Motion Boundaries Improve Semantic Segmentation? In ECCV Workshops 2016.
Yucen Luo. Conditional Generative Moment-Matching Networks
Abstract: Deep generative models (DGMs) are effective on learning representations of complex data and performing inference of data. DGMs are natural choices for probabilistic reasoning and uncertainty estimation tasks such as image generation and missing data imputation. Among them, generative adversarial network (GAN) [1] is built upon game theory, where two adversarial networks compete together. Although it has succeeded in solving various tasks, the training of the min-max formalism is usually unstable. The recently developed generative moment matching networks (GMMN) [2, 3] adopt maximum mean discrepancy (MMD) [4] as learning objective. MMD measures the distance of two distributions by mapping them into elements in a Hilbert space via kernel embedding. Thus GMMD gets a simple and efficient objective function and outputs the target distribution with samples from a simple distribution as input.

GMMN only gives an estimation of the joint distribution, however, we could be more interested in conditional distributions in many situations. For example, in predictive modelling of data, a conditional model p(y|x) is more direct than the joint distribution p(x, y), often leading to better performance with fewer training examples. Another case is contextual generation, where some context such as class labels, visual attributes and other information could be conditioned on for cross-modal generation. Moreover, conditional distributions are essential building blocks of large generative probabilistic models, e.g., one relevant work [5] stacks multi-scale conditional generative models to generate natural images.

In this work, we develop conditional generative moment-matching networks (CGMMN), which significantly extends GMMN to learn a conditional distribution given input variables based on a conditional maximum mean discrepancy (CMMD) criterion. Kernel embedding methods map a conditional distribution to operators in Hilbert space. For distribution p(X|Y), its corresponding operator can work on any y and get distribution p(X|y). The CMMD criterion measures the distance of two operators. We design a deep generative architecture using CMMD as the training objective. The learning is performed by stochastic gradient descent with the gradient being calculated by back-propagation.

We evaluate CGMMN on a wide range of tasks, including predictive modelling, contextual generation, and Bayesian dark knowledge [6], which distills knowledge from a Bayesian model by learning a relatively small CGMMN student network. Our results demonstrate competitive performance in all the tasks while holding a neat construction.

Note: This work will be presented at NIPS 2016 as a full research paper.

[1] I.Goodfellow,J.Pouget-Abadie,M.Mirza,B.Xu,D.Warde-Farley,S.Ozair,A.Courville,andY.Bengio.
Generative adverisarial nets. NIPS, 2014.
[2] G. Dziugaite, D. Roy, and Z. Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. UAI, 2015.
[3] Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. ICML, 2015.
[4] A. Gretton, K. Borgwardt, M. Rasch, B. Scholkopf, and A. Smola. A kernel two-sample test. JMLR, 2008.
[5] E. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep generative image models using a laplacian pyramid of adversarial networks. NIPS, 2015
[6] A. Korattikara, V. Rathod, K. Murphy, and M. Welling. Bayesian dark knowledge. NIPS, 2015.
Kathryn Zimmerman and David Field. Visualizing the basis of classification in deep neural networks
Abstract: Deep Neural Networks (DNNs) are widely regarded as the state-of-the art method in complex object recognition and image classification tasks. However, the reasons that DNNs arrive at a particular solution are often quite difficult to decipher. To address this problem, a number of researchers have been exploring visualization techniques as a tool for understanding network decisions. The motivation behind this line of inquiry is to quantify what cues, in a previously unseen data sample, cause a DNN to arrive at a specific classification decision. Recently, the focus has been on computing how important each pixel is with respect to the classification decision, which can then be represented as a heatmap. The heatmaps produced by these methods are often ambiguous and can be misleading when used as a tool to explain what in an input image caused a network's decision. Previous work has explored which pixels led to specific classifications, but it has failed to address how the network interpreted these as ‘important' pixels. Using the pre-trained AlexNet architecture, we probed the causal relationship between individual pixels and the networks classification decision. Through a heatmap derived visualization algorithm, we present an algorithm that helps us understand how the network is interpreting these important pixels when it arrives at a particular decision. Our study compares visualizations, computed by two different methods, using heatmaps generated by the sensitivity-based approach and Layer-wise Relevance Propagation (LRP). We believe that these visualizations provide constructive insights into how these networks classify successfully as well as provide insights into what causes a network to fail.
Nasim Souly and Roozbeh Zabihollahi. Story of life: story-extraction and summarization of personal photo albums using deep learning
Abstract: Extensive availability of mobile recording devices makes existing collections of image data immense and high in representation (resolution). There is a large amount of data on one’s device which are never used properly due to the lack of time, space or interest. Going through a large set of image data to select memorable images and tag them with meaningful words is a tiresome task, thus in this project we aim to employ machine learning and computer vision algorithms to summarize a large collection of image data based on saliency and aesthetic characteristics of the images. Furthermore, we leverage the recent achievements in the field for generating text from data using deep learning techniques to tag the images and tell a story about the albums of the photos.
Given a large number of photos, we begin with segmenting them temporally to find different scenarios. For example, if the photos belong to a trip to a national park we divide them based on time and locations. Next, for each segment, we prune the images based on the visual features such as brightness, contrast (not blurry), colorfulness and other aesthetic features. Also, we use a convolutional neural network trained on places data set to select the photos with high salience regions and tag the images.
We select the tags with high probability and match them with GPS and date-time data provided by the device to generate the meaningful tags for images. In addition, we exploit an RNN Language Model to generate a dense caption using tags (objects and places name) of selected images for each segment of the album. Thus, we summarize the large collection of photos in some interesting shots and we tag them with meaning full sentences to make it convenient to store and retrieve on a mobile device. We believe this is the first attempt to leverage machine learning techniques to make a story telling albums on mobile devices. We also provide interactive exploration to user by which one can improve the selected images.
Mobile devices have limited processing power while having good sensor set and networking capabilities. Nowadays, they store a lot of images that can be converted to stories. Our Mobile app includes implementation of the lightweight algorithms (e.g aesthetic feature detection or shot detection) but uses cloud computing for image tagging and sentence creation. We have chosen iOS as our first base platform since pictures on those devices include location data and date. Given that these devices are popular these days; we are hoping that we could use gathered data as a database for next phases of storytelling algorithms.

[1] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. “Learning Deep Features for Scene Recognition using Places Database.” NIPS 2014.
[2] Datta, Ritendra, et al. "Studying aesthetics in photographic images using a computational approach." ECCV 2006.
[3] Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions." CVPR 2015.
Jialian Li. Temporal Conditional Moment Matching Deep Neural Networks
Abstract: Temporal data is pervasive in the real world, such as text, video and audio. In contrast to the common i.i.d non-temporal data, temporal data is usually highly correlated and has more complex relationships through time. Thus, taking time into consideration makes learning relatively more difficult than that for static data. Aided by recent advances in deep neural networks, recurrent neural network (RNN) [1] appears to play a dominant role in temporal data related tasks. RNN uses internal hidden states to catch the dependency of data through time. Our previous work developed a novel deep generative model for conditional distributions, named conditional generative moment matching network (CGMMN) [2], which is accepted at NIPS 2016 as a full research paper. CGMMN is a flexible framework to represent conditional distributions. Since time series data modeling can be viewed as conditional models, we naturally extend CGMMN to time series. We propose temporal conditional moment matching deep neural network (TCMMD), a new approach to model temporal data.

CGMMN samples data for a conditional distribution, i.e., for distribution P(Y|X), we can sample y for any given x. The deep generative model is trained with an objective function describing the distance of two conditional distributions using kernel methods. Here we assume that the data at time t is dependent on that of previous time, without loss of generality, conditional on n time steps before t. Then a simple input distribution is mapped to the output time-dependent distribution through a parametric neural network. In fact, more connections among time steps could be involved in our model according to different tasks, which makes TCMMD a flexible model while still holds the simple structure. Moreover, the expansion along time is not needed in TCMMD, thus our model is lightweight and avoids the vanishing gradient problem.

Based on the construction of TCMMD, we demonstrate the effectiveness of our model on a dataset of human moving motion [3] and it performs well to construct a model for human walking sequences. After the training, the model captures the internal structure of data and could easily sample a walking sequence of human skeletons, using only 3 consecutive data from training data as the initial state. TCMMD is theoretically guaranteed using conditional maximum mean discrepancy and effective in the human motion experiments. In a nutshell, our model could catch complex time dependence and generate plausible temporal data while neater in structure and easier to train than RNN.

[1] Schmidhuber J. Learning complex, extended sequences using the principle of history com- pression. Neural Computation, 1992, 4(2):234–242
[2] Ren Y, Li J, Luo Y, et al. Conditional generative moment-matching networks. NIPS 2016
[3] Hsu E, Pulli K, Popović J. Style translation for human motion. ACM Transactions on Graphics (TOG), volume 24. ACM, 2005. 1082–1089
Luisa M Zintgraf, Taco S Cohen, Tameem Adel and Max Welling. Visualizing Deep Neural Network Decisions
Abstract: Over the last few years, deep neural networks (DNNs) have emerged as the method of choice for perceptual tasks such as speech recognition and image classification. In essence, a DNN is a highly complex non-linear function, which makes it hard to understand how a particular classification comes about. This lack of transparency is a significant impediment to the adoption of deep learning in areas of industry, government and healthcare where the cost of errors is high.

This work presents a probabilistically sound methodology for explaining classification decisions made by deep neural networks. The method can be used to produce a saliency map for each (instance, node) pair that highlights the parts (i.e., features) of the input instance that contributed most evidence for or against the activation of the given node (which could be an internal or output node).

The presented method is based on an instance-specific method by Robnik-Šikonja and Kononenko*: to estimate an individual input feature's value, the authors simulate the absence of that feature, and observe the effect this has on the classifier's prediction. E.g., if the predicted class score is lower without the inspected feature, it must carry some evidence for that class. This way, each input feature is attributed a relevance value reflecting its relative importance in the decision.

Our main contributions are three substantial improvements of the method of Robnik-Šikonja and Kononenko: conditional sampling, multivariate analysis, and deep visualization.
Conditional sampling allows us to make a much better approximation to simulating the absence of a feature which leads to a more accurate description of relative feature importance.
Further, instead of testing only single features, we utilize a multivariate method where several features (for images, small pixel patches) are removed together. This is especially effective with high-dimensional data like images, where single features alone do not have a significant impact compared to small groups of features.
Lastly, deep visualization allows us to not only analyze the input-output relationship, but also inspect the role of the inner workings of a deep network, i.e., look at what individual neurons are doing.

We provide several demonstrations of the technique on deep convolutional neural networks (DCNNs) trained on ImageNet data, as well as MRI brain scans of HIV patients with neurodegenerative disease. Our findings from these experiments include for example that when differentiating between elephant types (African el., tusker, Indian el.), the DCNN focuses most of its attention on the ears of the elephant which coincides with human understanding of the class differences. We were also able to illustrate what different feature maps in the DCNN are specialized on, e.g., that some are concentrating entirely on the background, and others on the eyes of animals, etc. For the MRI data, we were able to point out which brain regions support the classifier's decision, and that the results were slightly different across individuals, but still to some extend consistent per group.

* Robnik-Šikonja, Marko, and Igor Kononenko. "Explaining classifications for individual instances." IEEE Transactions on Knowledge and Data Engineering (2008): 589-600.
Emily Denton. Semi-Supervised Learning with Context-Conditional Generative Adversarial Networks
Abstract: Deep neural networks have yielded dramatic performance gains in recent years on a variety of tasks. These successes are heavily dependent on large training sets of manually annotated data. In many settings however, such large collections of labels may not be readily available, motivating the need for methods that can learn from data where labels are rare.

We propose a method for harnessing unlabeled image data based on image in-painting. A generative model is trained to predict pixels within a missing hole, based on the context provided by surrounding parts of the image. These in-painted images are then used in an adversarial setting [1] to train a large discriminator model whose task is to determine if the image was real (from the unlabeled training set) or fake (an in-painted image). The realistic looking fake examples provided by the generative model cause the discriminator to learn features that generalize to the related task of classifying objects. Thus, this task acts as a regularizer for standard supervised training of the discriminator.

Recently, [2] showed discriminator networks, trained in a generative adversarial network (GAN) framework, learn a hierarchical image representation that is useful for object classification. Such results suggest that combining an unsupervised GAN objective with a supervised classification objective would produce a simple and effective semi-supervised learning method. This is confirmed in our experimental results as well as in independently developed work [3]. However, GAN samples are most coherent when the set of images the network is trained on comes from a limited domain (eg. churches or faces). Additionally, it is difficult to train GANs on very large images. Both these issues suggest semi-supervised learning with vanilla GANs may not scale well to datasets of large diverse images. Rather than determining if a full image is real or fake, our proposed context conditional GAN addresses a different task: determining if a part of an image is real or fake given the surrounding context. We show that this alternative discriminative task produces superior results compared to a vanilla GAN discriminator.

Using our approach we are able to train large VGG-style discriminator networks in a semi-supervised fashion. We compare against other semi-supervised approaches, and in particular compare against other methods that utilize spatial context as a supervisory signal. We evaluate on STL-10 and PASCAL datasets, where our approach obtains performance superior to existing methods.

In summary, we present a simple semi-supervised learning framework based on in-painting with an adversarial loss. The generator in our context conditional GAN model is capable of producing semantically meaningful in-paintings and the discriminator performs better than existing semi-supervised methods on two classification benchmarks.

[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. NIPS 2014.

[2] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. ICLR 2016.
[3] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford and X. Chen. Improved Techniques for Training GANs. NIPS 2016.
Sara Taylor, Natasha Jaques, Ehimwenma Nosakhare, Akane Sano and Rosalind Picard. Multi-task Learning for Predicting Stress, Happiness, and Health
Abstract: We apply three Multitask Learning (MTL) techniques to the problem of predicting stress, happiness, and health: 1) A Hierarchical Bayesian method in which tasks share a common Dirichlet process prior (HBDPP); 2) A deep neural network in which several hidden layers are shared among tasks, but each task has unique final layers; and 3) Multitask Multi-Kernel learning (MTMKL), in which tasks share kernel weights on feature types. Data are collected from 24/7 monitoring of smartphone logs and wrist-worn physiological sensors from 187 participants of a 30-day study. These techniques were applied in two ways: 1) each wellbeing label (stress, happiness, and health) as a task, and 2) each user as a task. We achieve state-of-the-art results on our dataset, and demonstrate that this performance improvement is due to using MTL to account for individual differences in the relationship between behavior and mood --- an insight previously unknown in this area.

Perceived wellbeing as measured by self-reported stress, happiness, and health has a number of important clinical health consequences, including being indicative of clinical depression measures [1] and all-cause mortality [2]. Stress has been proven to increase susceptibility to infection and illness [3], and happiness is so strongly associated with greater longevity that the effect size is comparable to that of cigarette smoking [4]. Therefore it is clear that the ability to model and predict self-reported wellbeing could have an extremely beneficial impact on the treatment and prevention of both disease and mental illness.

Unfortunately, modeling wellbeing has historically been a difficult task, with typical classification accuracies ranging from 55-75%, even with sophisticated models or multi-modal data. Further, almost all previous work relates to mood estimation rather than prediction, and is often based on data gathered in the lab.

We advance the state of the art in this domain with four major contributions: 1) We obtain the best performance to date on our dataset; 2) We apply a number of novel techniques that have not previously been used for mood estimation; 3) Our data is gathered as participants go about their daily lives, rather than in a lab; and 4) We are able to predict future wellbeing, as opposed to simply estimating current wellbeing. Further, we provide novel clinical insights, e.g. through the implicit clustering of users provided by HBDPP, and weighting of input sources provided by MTMKL. We also demonstrate that through the use of several regularization techniques, good performance can be achieved even with a small dataset. These innovative ideas can greatly improve future modeling of wellbeing.

[1] Cheng and Furnham. Personality, self-esteem, and demographic predictions of happiness and depression. Personality and individual differences, 2003.

[2] Keller et al. Does the perception that stress affects health matter? The association with health and mortality. Health Psychology, 2012.

[3] Cohen, Tyrrell, and Smith. Psychological stress and susceptibility to the common cold. New England Journal of Medicine, 1991.

[4] Veenhoven. Healthy happiness: Effects of happiness on physical health and the consequences for preventive health care. Journal of happiness studies, 2008.
Timnit Gebru, Jonathan Krause, Yilun Wang, Duyun Chen, Jia Deng and Fei-Fei Li. Visual Census: Using Cars to Study People and Society
Abstract: Detecting a large number of BMWs in images informs us that those images may be of a wealthy area. Conversely, knowing that our images were obtained from a wealthy neighborhood increases the likelihood of detecting expensive cars. We explore this relationship between demographic factors and fine-grained classes by performing large scale detection of over 2600 car classes and conducting a social analysis of unprecedented scale in computer vision. Using 50 million images from 200 of the biggest cities in the United States, we predict demographic factors such as neighborhood wealth, education levels, voting patterns, crime rates, carbon footprint, income segregation and show that our results correlate well with census data. To facilitate our work, we have collected the largest and most challenging fine-grained dataset reported to date consisting of over 2600 classes of cars comprised of images from google Street View and other web sources,classified by car experts to account for even the most subtle of visual differences.
Michal Moshkovitz and Naftali Tishby. Information-Based Exploration for Reinforcement Learning
Abstract: The exploration versus exploitation trade-off is a well-known unresolved conflict in reinforcement learning: should an agent select an action that is known to be beneficial or an unfamiliar action that could provide more valuable information. Many algorithms (e.g., Brafman and Tennenholtz (2003); Strehl and Littman (2005, 2008); Kolter and Ng (2009); Lopes et al. (2012)) use the exploration bonus method to handle this conflict. In this method the agent assigns a bonus to actions and states that previously were not selected much. Each of the aforementioned algorithms uses a different bonus function, but all of them define the bonus to be some (decreasing) function of the number of times an action was previously selected.

The exploration bonus method was shown to be useful both theoretically and empirically. However, there are some cases where following this method, as was used so far, leads to undesirable results. For example, consider the following extreme case: suppose the agent can select between two actions a1 and a2. Its observations so far are that a1 always led to the same state while a2 always led to a new state. In this case the agent can reasonably presume that a1 is much more deterministic than a2, and thus, for an efficient exploration, the bonus assigned to a2 should be larger than the bonus assigned to a1. Therefore, efficient exploration cannot be achieved if the bonus is only a function of the number of times an action was previously selected. Generally, this problem occurs whenever the actions have varied degrees of ”randomness”.

We propose a bonus function that depends on the states that the agent reached so far, unlike previous works. Inspired by a line of work that combines information theory and control (Kappen (2005); Todorov (2006); Rubin et al. (2012)), we focus on the notions of predictive information and empowerment to quantify the predictability of our actions and the capacity of future options (see Bialek et al. (2001); Tishby and Polani (2011)). Plugging the logarithm of the past empirical transitions, which approximates both the predictive information and the empowerment, as an exploration bonus, yields a new algorithm which we call Information-Based Exploration (IBE). We then show an illustrative scenario where IBE outperforms other exploration algorithms (R-max, Bayesian Exploration Bonus, epsilon-greedy, and Model-based Interval Estimation with Exploration Bonus). Also, we demonstrate that when the degrees of randomness of the transition probabilities are similar to one another, IBE’s performance is not much different than the other algorithms.

While information theory was already used in reinforcement learning (e.g., Tishby and Polani (2011); Rubin et al. (2012); Todorov (2006); Kappen (2005); Van Hoof et al. (2015); Peters et al. (2010); Schulman et al. (2015); Still and Precup (2012)), our results are fundamentally different, as we focus on the empirical state transition function rather than the policy, which can be deterministic in our case.
Ying Yang, Robert Kass, Michael Tarr and Elissa Aminoff. Understanding neural dynamics of human vision using convolutional neural networks
Abstract: The human brain can recognize a visual scene or object in less than a second. Previous research has indicated an established hypothesis about the computational principles, that the visual cortex is organized in a hierarchical manner - from posterior to anterior parts of the brain, the cortical regions extract different levels of information, ranging from low-level local edges to high-level semantic properties (e.g. animate vs inanimate objects, indoor vs outdoor scenes). However, detailed spatio-temporal dynamics is still less understood.
In parallel, inspired by the feedforward direction in the hypothesis, convolutional neural networks (CNNs) recently achieved good performance in object and scene recognition. Moreover, the “activations” of units in different layers of CNNs also provide operational definitions of low-level to high-level features, which have been shown to well explain neural activity in the visual cortex[1].
Here we used regression analysis to study correlations between spatio-temporal neural responses and features extracted from a CNN, aiming to characterize what type of features are encoded at different time stages and spatial locations in the visual cortex. Using magnetoencephalography (MEG), featuring high temporal resolution and intermediate spatial resolution, we recorded neural activity while human participants viewed briefly presented (200 ms) photographs of scenes. We also extracted activity of units (or features) in a pre-trained 8-layer CNN ("AlexNet" [2]) in response to the presented photographs. In the regression analysis, we found most layers explained a significant amount of variance of the neural activity in 60~450 ms after the stimulus onset.
Then we focused on Layer 1 and 7, two representatives of low-level and high-level features. Noticing that the two layers might be linearly dependent, we ran a canonical correlation analysis to obtain the most correlated components between Layer 1 and 7, which formed a low-dimensional common space for both layers, representing low-level features that were correlated with high-level features. We also obtained principal components of the same dimensions of Layer 1 and 7 orthogonal to the common space, and termed them residual Layer 1 (representing low-level features less correlated high-level features) and residual Layer 7 (representing high-level features less correlated with low-level features).
We observed that residual Layer 1 explained more variance than residual Layer 7 in the early visual cortex (EVC) in the posterior part of the brain within the first 100 ms, whereas residual Layer 7 explained more variance than residual Layer 1 in more anterior visual cortex in 150~400ms. This early-to-late, posterior-to-anterior, lower-to-higher-level pattern was consistent with the feedforward processing in the hierarchical hypothesis. Interestingly, we also found that in the EVC at 350 ms (150 ms after the stimulus offset), the variance explained by the common space was greater than that by residual Layer 1, suggesting that the EVC could differentiate low-level features that were corerelated with high-level features from the ones less correlated with high-level features, possibly due to some feedback processing.
[1] Khaligh-Razavi, S. M., & Kriegeskorte, N. (2014). PLoS Comput Biol, 10(11), e1003915.
[2] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). NIPS (pp. 1097-1105).
Clarissa Valim, Rushdy Ahmad, Miguel Lanaspa, Yan Tan, Sozinho Acacio, Michael Gillette, Katherine Almendinger, Danny Milner, Lola Madrid, Karell Pelle, Jaroslaw Harezlak, Jacob Silterra, Pedro Alonso, Steven Carr, Jill Mesirov, Dyann Wirth, Roger Wiegand and Quique Bassat. Biomarker signatures based on blood proteins distinguish bacterial, virus,and malaria etiology of pediatric clinical pneumonia
Abstract: Plasma-detectable biomarkers that rapidly and accurately diagnose bacterial infections in children with suspected pneumonia could reduce the morbidity of respiratory disease and decrease the unnecessary use of antibiotic therapy. We sought to identify protein combinations that could discriminate bacterial from viral or malarial diagnoses through measurement of 56 markers in a multiplexed immunoassay analyzed through classification trees, support vector machine (SVM), and regression. We selected 80 patients with clinically diagnosed pneumonia (as defined by the World Health Organization) who also met criteria for bacterial (N = 23), viral (N = 30) or malarial (N = 27) infections based on clinical, radiographic, and laboratory results. Ten healthy community controls were enrolled to assess marker reliability. Patients were subdivided into two sets: one for identifying potential markers (2/3) and another for validating them (1/3). The optimal signatures were identified in the training set through comparison of all three groups and were validated by comparing patients with and without a bacterial infection. Three proteins (haptoglobin, TNFR2 or IL-10, and TIMP-1) were identified that when combined through a classification tree signature (penalizing misclassification of bacterial cases accurately classified patients into bacterial, malarial and viral etiologies, and misclassified only one patient with bacterial pneumonia from the validation set. The overall sensitivity and specificity of this signature for the bacterial diagnosis were 96% and 86%, respectively. Alternative combinations of markers with comparable accuracy were selected by support vector machine (using recursive feature elimination) and regression models (using elastic net) and included haptoglobin, IL-10, and CK-MB. In regression models, the cross-validation error curves in 150 replications were averaged to smooth curves and help to identify tuning parameter values associated with the minimum cross-validate deviance. Although accuracy of signatures selected through SVM and regression was comparable, SVM resulted in selection of a more parsimonious signature. Combinations of plasma proteins that are unspecific on their own accurately identified children with a respiratory syndrome who were likely to have bacterial infections and who would benefit from antibiotic therapy. When used in conjunction with malaria diagnostic tests, they may improve diagnostic specificity and simplify treatment decisions for clinicians. Currently several point of care multiplex technologies could be used to implement the identified signature and improve patient care. This work has been published in the American Journal of Respiratory and Critical Care Medicine in 2015. Future developments will seek to validate the identified biomarker signature with a larger number of patients and in different study sites.
Marcela Mendoza, Jeffrey Mckinstry, Arnon Amir and Dharmendra Modha. Real-Time Gesture Localization and Classification Using Cognitive-Inspired Hardware
Abstract: There is a demand for mobile and embedded-system user interfaces to interact in a natural and intuitive manner and perform increasingly “cognitive” functions to infer a user’s intent. These increased cognitive demands generate the need for efficient energy consumption and novel computing paradigms for real-time systems.

Recent work on neuromorphic computing has demonstrated hardware with novel architectural primitives able to perform cognitive functions for extremely low power. We have built a real-time, neuromorphic, low-power system for gesture localization and classification. This system mimics biological visual perception and recognition on neuromorphic substrates by localizing and correctly classifying gestures in real time for user-interface applications.

In this system, a Dynamic Vision Sensor (DVS), which is a retina-inspired camera that detects changes in the intensity of the scene, collects gesture event-based data from a user. This data is processed in real time using the IBM TrueNorth chip – a one-million-neurons, spiking-neural-network chip that typically consumes 100mW. Gesture classification and localization are implemented using two energy-efficient neuromorphic convolutional neural networks (CNN) running on one TrueNorth chip. Our group has previously demonstrated that these energy-efficient networks approach state-of-the-art classification accuracy performance despite low-precision weights and binary activation functions. Gesture classification is achieved by training the first CNN with event-based gesture data from the DVS. Localization is implemented by training a second CNN on the presence/absence of a gesture of interest. The learned CNN is then expanded to operate on a larger image. The output of the resulting trained network is a heatmap representing the image regions where the gesture of interest is probable to appear. Both CNNs run at 1000 classifications per second and are coupled within a real time system for a natural user interface experience. To our knowledge this is the first time that gesture classification and localization have been performed using only neuromorphic substrates.
Putu Ayu Sudyanti, Vinayak Rao and Hyonho Chun. Estimating Contamination Rates and Detection of Somatic Mutations in Paired Tumor and Normal NGS Data via Bayesian Nonparametrics
Abstract: Identification and classification of cancer subtypes has become increasingly quantitative with the advent of next generation sequencing technologies. Somatic mutations, i.e. changes to the genome that are specific to tumor cells, have been proven to play an important role to cancer development and thus their identification is a necessary step in understanding cancer cell progression. Because comparing mutation rates between cancer and normal tissues involves unknown biological contamination between samples, any modeling techniques must account for this issue. We propose a hierarchical Bayesian nonparametric model to predict somatic mutations while allowing for base-pair level specific contamination rates.

Next generation sequencing (NGS) is characterized by aligning snippets of DNA, i.e. short sequences comprised of ATCG known as reads, to a reference genome. Because each read may or may not contain a mutation at a given location, it can be modeled as a Binomial B(n,p) random variable, where n is the total number of reads. Comparing tumor and normal tissue samples from the same subject control for natural genetic variation allows for differentiation between the two types of mutations. The nature of NGS where batches of cells are sequenced simultaneously, however, results in samples with an unknown degree of contamination. Because tumor sample may contain fractions of normal cells which affect downstream analysis, it is essential to estimate the contamination rate of the sequencing results while simultaneously modeling somatic mutations along the genome.

We propose a Dirichlet Process Mixture Models model for data consisting of variant counts and depth of reads at different locations of the genome across 22 chromosomes. We model contamination rates in this data as following a Dirichlet process prior. There are two extremes at which the contamination rates can be modeled: independent rates at each sites of the genome or uniform rates across all sites of the genome. We extend this model to a more flexible framework, by allowing contamination rates to cluster at shared values. We develop a Markov chain Monte Carlo algorithm to sample from the joint posterior distribution of the contamination rate and estimate the joint-genotypes at each location. Estimation of the joint-genotypes allows for inferences on the somatic mutations. We test our model on both synthetic and real data for accuracy.
Jennifer Wei, David Duvenaud and Alan Aspuru-Guzik. Convolutional networks for organic chemistry reaction prediction
Abstract: Designing an efficient synthetic pathway for making an organic molecule is a challenging process even for organic chemists with years of lab experience. In creating a synthetic pathway, one must simultaneously minimize the number of steps in the pathway, the cost of starting materials, and the waste generated by each reaction. There are approximately 500 different reaction classes[1], and an organic chemist must be familiar with the applicability and limitations of each reaction when constructing a synthetic route for her molecule. To aid in synthetic design, it is desirable to develop a machine learning algorithm that, like humans, could develop intuition for the available reactions and their application from previously seen examples.

In order to be able to design a synthetic pathway, the algorithm must first be able to predict the results of individual reaction steps and string them together into a pathway. In this work, we developed a two-part algorithm that uses neural networks to predict the products of a reaction given only the starting molecules as an input. The first part of the algorithm uses a neural network to predict the reaction type from features derived from the molecular graph of the reactants, created from treating the atoms of the reactants as nodes, and the bonds of the reactants as edges. The second part of the algorithm uses a recurrent neural network to predict the SMILES representation (Simplified Molecular Input Linear Entry System) using both the features of the reactants and predicted reaction type as inputs.

Here, I present the results of this work on two sets of data. The first set contains reactions of alkylhalide and alkene molecules, which are some of the first reaction families that are taught to first year organic chemistry students. These reactions were generated by applying simple transformation rules to alkylhalide and alkene molecules, which themselves were generated systematically using the cheminformatic package, RDKit. The algorithm was able to achieve 80% accuracy on selected textbook problems from the relevant chapter[2]. The second set of reactions is a set of Diels-Alder reactions based on published reactions, pulled from the Chematica network of organic chemistry reactions[3]. This set of reactions is particularly challenging because there are multiple possible reaction cites in the starting molecules, resulting different possible orientations for the final product that are difficult to categorize generally. The most prevalent orientation is determined by the adjacent groups of atoms attached to the reactive cite. For this section, we use the recurrent neural network to predict the correct product molecule graph using only the reactant molecules as inputs.

Note: The first part of the reaction prediction can be found at http://arxiv.org/abs/1608.06296

1. Sansom, Clare. Chemistry World. Royal Society of Chemistry. 13 August 2015. https://www.chemistryworld.com/feature/exploiting-the-data-mine/8850.article
2. Wei, Jennifer N, Duvenaud, D. and Aspuru-Guzik, A. ACS Central Science. In press. Preprint: arXiv:1608.06296.
3. S. Szymkuć, E. P. Gajewska, T. Klucznik, K. Molga, P. Dittwald, M. Startek, M. Bajczyk, B. A. Grzybowski, Angew. Chem. Int. Ed. 2016, 55, 5904. DOI: 10.1002/anie.201506101
Negar Ghourchian, Michel Allegue and Doina Precup. Semi-supervised Approach for Passive Indoor Localization using Wifi Signals
Abstract: Indoor localization is an essential function of smart homes, which enables discovering valuable knowledge about performances, behaviour and preferences of the residents, especially those who need long-term monitoring. We are interested in a recent research area that focuses on developing device-free passive (DFP) technologies for localization, since they do not require users to carry any devices. The key idea among most of DFP systems is to adopt Radio-frequency (RF) sensors, e.g. wifi, to monitor the influence and reflection of human body movements on the strength and pattern of wireless signals within sensing area.
We introduce a novel solution that solely utilizes off-the-shelf wifi-enabled devices e.g. access points (APs), laptops and smart TVs for passive sensing of the environment. Particularly, we extract channel state information (CSI) values that describe the channel property of a connection between transmitter-receiver, which offers the capability of estimating the location of a moving entity. We have applied various data mining techniques to statistically model the correlation between complex propagation of wifi signals and the location of movements. However, the CSIs values are extremely sensitive to many parameters e.g. antenna orientation, dynamic environments and signal interferences, which lead to the next major challenge, where the underlying structure of learned model changes through time. This problem can be considered as concept drift where the distribution of inputs changes over time in unforeseen ways, which strongly affects the accuracy of the learner.
We proposed a real-time semi-supervised framework that initially trains a base learner, using Random Forests classifier, to discern the mapping between signals and different locations from a small amount of labeled data. Random forests are ensemble predictors that provide a final class label as well as class prediction probabilities. In the online testing phase, we take advantage of these probabilities to establish “high confidence intervals” over the stream of unlabeled data and keep them and their associated labels in a repository to reuse when drift occurs. Then, we construct a KL-divergence-based score that automatically recognize substantial drifts in the feature space distribution over streaming data. When a significant drift happens, the algorithm resamples from the recent repository to update the initial training set and automatically retrain the classifier to maintain the performance.
We evaluated our proposed algorithm on real data collected from 5 different residential apartments, over 30 experiments, using a loptop equipped with CSI collection tool as receiver and an AP as transmitter to create wifi connection for room-level localization. At each experiment, the initial training set Included captures of CSIs while a user was walking inside each room for 30sec and then, a few diagnostic (test) sets were captured with the same setup a while after the initial training to obtain some examples of drifts. On average, the initial accuracy of prediction was 95±3 %, over 4-6 classes, which degraded to 87±3 %, 83±4 % and 72±4% after 60min, 90min and 5Hours, respectively. The experimental results show that the proposed adaptive solution resolves the temporal instability issue significantly and improves the degraded accuracies to 92±3%.
Tamara Louie. Near-Real-Time Influenza Prediction using Dynamically Recalibrated AutoRegressive Electronic health record Support Vector Machine (ARES) Models
Abstract: In this research, we present an approach to obtain near-real-time influenza predictions using dynamically recalibrated support vector machine models we call AutoRegressive Electronic health record Support vector machine (ARES).

Accurate prediction of influenza activity is crucial for public health work. Current influenza surveillance systems in the United States provide epidemiologic intelligence that allows health officials to deploy preventive measures and help clinic and hospital administrators make optimal staffing and stocking decisions. The United States (US) Centers for Disease Control and Prevention (CDC) monitors several influenza metrics, including influenza-like illness (ILI), which has availability up to a time lag of 1-2 weeks, leaving a need for real-time influenza prediction.

We built a series of dynamically recalibrated ARES models to predict current CDC-defined influenza ILI levels at the national and regional level in the US. We compared these estimates to those produced by two baseline methodologies: (a) dynamically-trained autoregressive AR(2) models that used historical CDC information, and (b) dynamically-trained linear regression model that used EHR ILI features.

We evaluated model performance using average Pearson correlation, RMSE, and relative RMSE in each region and nationally, and plotted heatmaps of the ARES model coefficients versus time to understand the predictive power of each feature. Those with the highest predictive power were CDC’s ILI value during the previous week of influenza estimation (t-1), EHR’s viral visit counts during the week of the estimates (t), and EHR’s ILI during the week of the estimates (t).

Results indicated that influenza estimates using ARES lead to 2–3 fold error reductions in the national and regional estimates, when compared to the dynamic linear regression model, as well as comparable or improved Pearson correlations at the national level and regional levels, with the exception of regions 7 and 10 during the 2013-2014 influenza seasons.  When compared to the autoregressive, AR(2) model, ARES improved Pearson correlation (from 0.958 to 0.996) and reduced RMSE three-fold for the national level. ARES also correctly estimated the timing and magnitude of the national peak week of the 3 influenza seasons predicted, as well as correctly estimating the timing for at least 2 (out of 3) influenza seasons for 6 out of 10 regions.

This approach provides evidence that EHR data, in combination with historical CDC influenza data and a dynamic machine learning algorithm, can provide reasonable influenza estimates in a more timely manner than the CDC at the national and regional level in the US.

A key limitation of this approach is the lack of EHR data availability to the general public. However, this research could be extended to many applications.  Examples include: longer forecasts of influenza activity; using ARES as input into an ensemble approach of digital signals or models; adding network features; and implementing ARES at the state and city levels, in other countries, as well as for other communicable diseases. 
Bianca Dumitrascu, Roberta de Vito, Christopher Brown and Barbara Engelhardt. A meta tissue nonparametric factor analysis model for gene co-expression under structured and unstructured noise
Abstract: In high-throughput genomic meta-analysis studies two assumptions stand out: a) that studies are independent and therefore uncorrelated, and b) that subjects across studies are nonoverlapping. Violation of either of these assumptions leads to spurious results due to structured biological noise and the difficulty of distinguishing the noise from true signal.

We address these challenges in the context of extracting co-expression gene clusters essential to the identification of patterns across tissues coming from correlated brain regions of overlapping sets of individuals. Our method is a meta-study Bayesian nonparametric generalized latent factor model with three important features a) an interpretability enhancing Indian Buffet Process (IBP) prior over a common loading matrix modeling the true underlying gene correlation structure, b) a two-component spike and slab prior over study specific loading matrices accounting for the structured technical noise, and c) a multivariate error term shared across---and accounting for---overlapping subjects. Our computationally efficient algorithm brings together a traditional Gibbs sampler and a fast dynamic posterior exploration EM algorithm (Rockova et al, 2015).

We illustrate the advantages of the proposed method through simulations and an application to a high dimensional biological data set. Specifically, we identify brain specific co-expressed gene clusters within the Genotype-Tissue Expression (GTEx) v6 project using RNA-sequencing of samples from a total of 449 individuals, across 13 different, but highly correlated, brain tissue subtypes.
Shabnam Tafreshi and Mona Diab. Classifying Emotions in Sentence-level Text Using Tree Kernel with Syntactic and Semantic Features in Multi-Genre Corpus
Abstract: Predicting emotion categories in sentence-level encompasses several applications such as online chatting, news classification, and situational awareness both for authoritarian and the public. Literature research has shown the significant role semantic features play in predicting and detecting emotions at sentence-level. In this study, we aim to expand the horizon of emotion detection at sentence-level by using semantic aspect of word vector models to craft general and portable features, which can be utilized across different genres and different piece of text (i.e. sentence and clause). To fulfill our aim we utilized a multi-genre corpus containing three different genres: emotional blog post, news headlines, and movie review. We investigate whether word2vector clusters can provide gains in addition to emotion lexicons, when associating emotion tag of the lexicon words to word2vector clusters. Further, we use each cluster as feature to capture the semantic appearance of words related to each emotion.

We carry out the evaluation using a supervised framework, and employed Tree Kernel functions and Support Vector Machine (SVM) to generate different feature spaces based on semantic representations of input data. We build two models for classification task: a) 9-way classification task of classifying emotion into Plutchik's 8 emotions from the set {joy, trust, fear, surprise, sadness, disgust, anger, anticipation} plus the label no-emotion and we experiment with Tree Kernel based model, b) we use corpus features (uni-gram, bi-gram, part of speech), and emotion lexicon features as baseline using the emotion labels above. We evaluate the impact of our semantic features in a supervised platform using SVM. Our experiments show that both Tree Kernel and SVM benefit from our features, within each genre in the corpus and when we use the multi-genre corpus.
Coline Devin, Abhishek Gupta, Sergey Levine, Trevor Darrell and Pieter Abbeel. Modular Networks for Multi-Task Multi-Robot Transfer
Abstract: Reinforcement learning (RL) can automate a wide variety of robotic skills, but learning each new skill requires considerable real-world data collection and manual representation engineering to design policy classes or features. Using deep reinforcement learning to train general purpose neural network policies alleviates some of the burden of manual representation engineering by using expressive policy classes, but exacerbates the challenge of data collection, since such methods tend to be less efficient than RL with low-dimensional, hand-designed representations.
Transfer learning can mitigate this problem by enabling us to transfer information from one skill to another and even from one robot to another.
We show that neural network policies can be decomposed into "task-specific" and "robot-specific" modules, where the task-specific modules are shared across robots, and the robot-specific modules are shared across all tasks on that robot.
This allows for sharing task information, such as perception, between robots and sharing robot information, such as dynamics and kinematics, between tasks. We exploit this decomposition to train mix-and-match modules that can solve new robot-task combinations that were not seen during training. Using a novel neural network architecture, we demonstrate the effectiveness of our transfer method for enabling zero-shot generalization with a variety of robots and tasks in simulation for both visual and non-visual tasks.
Konstantina Palla, François Caron and Yee Whye Teh. Bayesian nonparametrics for Sparse Dynamic Networks
Abstract: We propose a Bayesian nonparametric prior for time-varying networks. We are interested in time series settings, where we observe the evolution of links among objects over time. As an example, links may represent friendships in a social network. Such networks are typically represented by random graphs, where the vertices and edges correspond to objects and links respectively. In graphs setting, sparsity is defined in terms of the rate in which the numbers of edges grows as the number of nodes increases. In a sparse graph the number of edges grows sub-quadratically in the number of nodes such that in the limit, two nodes chosen at random are very unlikely to be linked. While sparsity is a property found in many real-world network datasets [Newman2009], most of the popular Bayesian models used in network analysis account for dense graphs, i.e. the number of edges grows quadratically in the number of nodes. The fundamental reason for this misspecification is the classical representation of the graph as a random exchangeable array, i.e. the adjacency matrix, [Orbanz2015]. Exchangeability in the graphs domain has been historically defined as the distribution invariance to the permutation of the order that the nodes appear, i.e. relabelling the nodes does not change the distribution of the graph and is known as vertex-exchangeability. However, as a corollary of the Aldous-Hoover theorem [Aldous1981, Hoover1979], exchangeable random arrays are dense or empty and thus not appropriate for most real applications. In an attempt to account for sparse graphs, several models have been proposed but with undesirable properties. These are models that give up either exchangeability [Barabasi1999] or projectivity [Bollobas2007, BollobasRiordan2009, WolfeOlhede2013, BorgsChayesCohnZhao2014]. In their recent work, [Caron2014a] proposed a model that represents graphs as infinite point processes on $\mathbb{R}^2_+$ giving rise to a class of sparse random graphs. For the associated notion of exchangeability of point processes, Kallenberg [Chapter 9, Kallenberg2005] provides a representation theorem as the continuous space counterpart of the Aldous-Hoover theorem. In this work, we are interested in the dynamic domain and aim to model probabilistically the evolution of sparse graphs over time. Towards that, we build the model proposed by Caron and Fox and extend it to time series settings.

More specifically, to each node of the network is associated a positive parameter, modelling the sociability of that node. Sociabilities are assumed to evolve over time and are distributed according to a time-varying generalised Gamma process. The model is able to (a) capture smooth evolution of the interaction between nodes, allowing edges to appear/disappear over time (b) capture long term evolution of the sociabilities (c) and yield a broader class of dynamic graphs ranging from dense to sparse. We provide some theoretical insights into the model and apply it to three real world datasets including Reuters news dataset, Facebook social circles and Wikipedia article links.
Amal Rannen Triki and Matthew B. Blaschko. Stochastic Function Norm Regularization of Deep Networks
Abstract: Although deep neural networks (DNNs) have dramatically improved image classification systems, they require massive labeled databases to train accurate systems. Indeed, DNNs have a complicated mathematical structure. They are highly non-convex, and classical approaches to training on smaller datasets, such as L2 regularization, cannot be straightforwardly applied. This has precluded their application in important domains where large datasets are not available, such as medical image analysis. This lack of data is due to the high cost of image acquisition, frequently requiring specialized equipment, and the need of expensive, highly qualified experts to annotate the collected samples. Furthermore, many important diseases are rare, meaning that few training examples are available even when image acquisition is inexpensive.
In this work, we propose a new regularization method for DNN training. The state-of-the-art methods are based mainly on weight decay and DropOut. These methods yield impressive performance when a large training set is available, but their performance decreases when applied to only few hundreds of data samples. Moreover, they deal with the complexity of the function represented by the neural network indirectly. Our idea is to study the feasibility of directly regularizing with the L2 function norm.
The computation of the function norm is not easily accessible, mainly because of the problem high dimension. Thus, we consider using weighted norms. Rather than integrating the network function with respect to the Lebesgue measure, we propose to use two probability measures. The first measure is the data distribution. To obtain an unbiased estimate of this weighted norm, we operate in two steps: (i) at each step of training, the network is applied to a batch of images that are not in the training set; (ii) we compute the average of the l2 norm of these outputs. The images used for this estimation do not need expert annotation. The second distribution is proportional to ||f(x)||2 . The main difference with the first method is that in the step (i), the network is applied to samples drawn from this distribution, avoiding the need to acquire new images. The regularization samples are drawn using slice sampling.
For both of the methods, we provide an algorithm integrating them in backpropagation. Moreover, the convergence of the stochastic gradient descent applied to these new objectives is proved. We finally show that the proposed methods applied to convolutional networks outperform the state-of-the-art methods in the low sample regime on benchmark datasets (MNIST and CIFAR10). The obtained results demonstrate clear improvement. The methods have also been applied to classify Optical Coherence Tomography images of breast tissue to detect the presence of cancer in an intraoperative environment. The obtained results show very good qualitative performance, and are currently being evaluated for possible clinical application in a major university hospital.
The proposed methods are very promising in the context of small sample regimes with data lying in a low dimensional manifold.
A extended version of this work is available as an arXiv preprint at https://arxiv.org/pdf/1605.09085.pdf, and an open source implementation can be found at https://github.com/AmalRT/DNN_Reg.
Kurt Bollacker, Natalia Díaz Rodríguez and Xian Li. Beyond Clothing Ontologies: Modeling Fashion with Subjective Influence Networks
Abstract: Extracting knowledge and actionable insights from fashion data still presents challenges due to the intrinsic subjectivity needed to effectively model the domain. Fashion ontologies help address this, but most existing such ontologies are “clothing" ontologies, which consider only the physical attributes of garments or people. In this context, subjective judgments are usually modeled only as opaque categorizations of entities.
If our personal stylists were to search for the perfect outfit for one of our customers who is asking advice for an outfit for his first Burning Man attendant or for a 50 year wedding anniversary party, it would be ideal if we could search in our database for styles such as rockabilly, retro, boho or “best pieces from the 70’s”. However, despite having ample of physical attribute annotations from different vendors, as well as algorithmically generated garments, we still do not have a sufficient purchase on the abstract attributes of a consumer’s perception, such as the aforementioned categories.
We address this problem by proposing a supplementary ontological approach in the fashion domain based on subjective influence networks. They measure novelty, impact and represent influence mechanisms that can validate or refute hypotheses that contain subjective or aesthetic components. The components of an influence vector instance consist of the elapsed time between the beginning of styles, the magnitude of the influence, the mechanism of influence, and the agent of influence.
We enumerate a set of use cases this approach is intended to address and discuss possible classes of prediction questions, hypothesis testing and machine learning experiments that could be executed to validate or refute the model. For instance, using network mapping, we could find answers to: Can influences in other cultural domains such as music be used to predict fashion influences? Other application is quantifying subjective attributes, for instance, subjective influence networks could characterize the differences between retro and classic glasses and their mechanisms of influence. Example hypotheses to confirm/reject could be: Are retro glasses those worn by 2 generations back? Are classic glasses those that never go out of fashion? Other example of use case is predicting fashion cyclicality, and so, we could postulate: Can influence networks tell apart fashion cycles periodicity to predict when bell trousers will be fashionable again? Representing fashion evolution on social media (e.g. hashtags on Instagram) as influence mechanism is an example to measure influence within time and space scales to evaluate the “viral” nature of rapid style changes.
Through the use of subjective influence networks, we plan to augment the Knowledge Graph of fashion information, a search engine that allows advanced faceted search to infer features even if they are not explicitly tagged in the merchandise. The final aim is understanding and translating into machine-consumable manners the way humans perceive and transmit aesthetics and style in fashion, and more generally, subjectivity, something that computers do not (yet) do better than us.
Tsui-Wei Weng, Daniele Melati, Andrea Melloni and Luca Daniel. Non-Gaussian Correlated Process Variations in Integrated Photonics
Abstract: During the last few decades, photonics has attracted a large interest as a viable technology for interconnect networks due to its ability to achieve high bandwidth and low power dissipation [1]. Depending on the material of the systems, photonic circuits can be potentially integrated with existing CMOS process in electrical circuits and thus can remarkably reduce fabrication costs as well as improve the overall system performance [2]. Some technologies allow also for direct integration of light sources [3]. Nevertheless, due to its tiny scale (in the micrometer and nanometer range), photonic devices are very sensitive to fabrication process variations, leading to potentially significant device performance degradations as well as system failures [4]. Therefore, it is highly desirable to predict the statistical behavior of a photonic device before the final fabrication in order to reduce the design and production costs.

Meanwhile, Machine Learning is a very powerful method and has been successfully applied to many fields, such as finance, artificial intelligence, computational biology, etc. However, not much research effort has been devoted to integrated photonics. In this work, we use machine learning techniques to analyze fabrication process variation data in integrated photonics and then develop an efficient numerical technique to simulate photonic components and devices with process variations. Specifically, Gaussian mixture models are used to model the variations of devices parameters and the EM algorithm [5] is used to learn the intrinsic correlations from the measurement data. Our proposed technique is demonstrated by the simulation results of a ring modulator example, and compared with conventional Monte Carlo simulations [6] at the same level of accuracy in the probability density function.

* Partial results of this work are published in [7], and some are on-going work.
[1] A. Biberman and K. Bergman, “Optical interconnection networks for high-performance computing systems,” Reports on Progress in Physics 75, 046402 (2012).
[2] J. Sun, E. Timurdogan, A. Yaacobi, E. S. Hosseini, and M. R. Watts, “Large-scale nanophotonic phased array,” Nature 493, 195–199 (2013).
[3] Meint Smit, et al. "An introduction to InP-based generic integration technology."Semiconductor Science and Technology 29.8 (2014): 083001.
[4] A. Krishnamoorthy, X. Zheng, G. Li, J. Yao, T. Pinguet, A. Mekis, H. Thacker, I. Shubin, Y. Luo, K. Raj, and J. Cunningham, “Exploiting CMOS manufacturing to reduce tuning requirements for resonant optical devices,” IEEE Photon. J. 3, 567–579 (2011).
[5] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the em algorithm,” J. R. Stat. Soc. Series B pp. 1–38 (1977).
[6] S. Weinzierl, “Introduction to Monte Carlo methods,” arXiv preprint hep-ph/0006269 (2000).
[7] Tsui-Wei Weng, Zheng Zhang, Zhan Su, Youssef Marzouk, Andrea Melloni and Luca Daniel, "Uncertainty Quantification of Silicon Photonic Devices with Correlated and Non-Gaussian Random Parameters," Optics Express, 23, Issue. 4, pp.4242-4254, (2015).
Victoria Dean, Andrew Delong and Brendan Frey. Deep Learning for Branch Point Selection in RNA Splicing
Abstract: RNA splicing is a crucial component of the gene to protein process in every cell of every complex organism. Splicing works by recognizing local RNA patterns within a gene and removing introns (noncoding regions) to form the coding sequence that is translated into a protein. It is estimated that at least 15% of disease-causing mutations are caused by mis-splicing, usually by altering these local patterns. Branch point recognition is a necessary sub-step for splicing to occur. If there is a mutation in an exon’s only branch site, the branch site could be deemed unusable, and the exon would be skipped. In this work, we present results from our exploration of different model architectures for predicting branch site selection. We show that a deep neural network substantially improves over PWM-based models to achieve state-of-the-art results for identifying disease-causing mutations that disrupt splicing through the branch site mechanism.

Previously, splicing research has not focused on understanding branch site patterns, because there hasn’t been a large branch site dataset. Last year, a new dataset of 59,359 high-confidence branch points was released. We explore various machine learning models on the task of predicting branch points using this dataset.

As a baseline, a position weight matrix (PWM) was created by aligning 9 nucleotide sequences surrounding each branch point in the dataset. In addition to the sequence, the distance from the branch point to the 3’ splice site is an important factor: for nearly 90% of branch points in the dataset, there was at least one annotated 3'SS within 18-40 nucleotides downstream. An additional baseline was created by combining the distance feature with the PWM score in a neural network.

Our best model is a convolutional neural network (CNN) which takes the sequence and distance from the 3’ splice site as input features. The first layer convolution filters learned a similar motif to the PWM, but also learned subtler patterns. This network achieved better results than the best baseline (which combines PWM scores and distance). We call our model BRANCHR. This branch site model is able to improve existing splicing models, by identifying false positive 3’ splice sites that are lacking a branch site. We present results from extending MaxEntScan with BRANCHR, which substantially improves the sensitivity, especially on “hard” sites where MaxEntScan is not giving a confident prediction.

The model can also help to classify variants of unknown significance in the intronic region near the 3’ splice site, and even re-classify variants previously thought to be benign or pathogenic. We show example ‘mutation maps’ with mutations from literature and ClinVar on how these variants could affect the branch site.

This work is a demonstration of the power of deep learning applied to fundamental problems in genome biology and disease diagnostics, as genetic data becomes more readily available.
Sabina Tomkins, Anbang Xu, Yufan Guo and Zhe Liu. Discovering Dialog Acts on Social Media: A Generative Framework
Abstract: We introduce a framework for discovering dialog act templates, and use these templates to generate responses for customer service agents on social media. Increasingly, customers are turning to social media for their customer service needs. However, the generation of templates which can provide agents with high quality responses has not kept pace with customers’ changing demands. To meet the evolving needs of customer service systems, we propose a framework for rapidly generating Twitter reply templates. To assess the validity of our method we assess the BLEU score between each generated and original sentence.

Our framework involves three steps: we first separate all agent and customer documents into two distinct corpora, where customer documents are tweets, and agent documents are sentences. The documents in each corpus are then grouped into dialog act clusters. Next, given a customer document, we assign it a cluster id, and analyze its emotional tone, using IBM’s Tone Analyzer API. Given the customer cluster id, we use a custom probabilistic graphical model to infer the cluster id of each sentence in the agent reply. Finally the agent cluster id’s and customer tone distribution are used to generate a doc2vec vector representation of a reply.

By evaluating the BLEU score between the original and generated agent replies, we can optimize our framework in a number of ways: we select the optimal clustering method and number of clusters, we modify the probabilistic graphical model, and we determine which sentence generation technique to employ.

To evaluate our framework we collected Twitter conversations from 60 brands, using Twitter's open source API. The brands were chosen to be representative, and thus range from clothing companies to airlines. We pool all of the brand data together, in order to find dialog acts consistent across customer service conversations, and independent of the domain context. The dataset consists of conversations between agents and customers. All retweets are removed, and a conversation must be started by a customer. To ensure that only high quality templates are generated, we consider only those agent tweets which have a favorite count of at least one.

Preliminary results show that we are able to detect meaningful dialog acts specific to customer service. Example dialog acts include: apologizing / emphasizing, directing conversations offline, and technical troubleshooting. While this framework has been successful for customer service, it requires no expert knowledge and can be extended to other applications as well.
Navya Nizamkari. A Graph-based Trust-enhanced Recommender System for Service Selection in IOT
Abstract: In an Internet of Things (IOT) network, each node (device) provides and requires services. With the growth in IOT, the number of nodes providing the same service has also increased, thus creating a problem of selecting one reliable service from among many providers. In this paper, we propose a scalable graph-based collaborative filtering recommendation algorithm, improved using trust to solve service selection problem, which can scale to match the growth in IOT unlike a central recommender which fails. Our IOT network follows a Social IOT(SIOT) paradigm, where nodes can interact and form relationships with others, similar to a social network. Relationships such co-location, co-ownership, co-parent and friendship can be formed autonomously by nodes.

Every node after interaction with another, gives a rating to the node based on the service it provides. A recommender system can be used to solve the problem of service selection by predicting ratings and suggesting node with the highest rating. Although a central traditional collaborative filtering (CF) recommendation engine can be used for prediction it will have two major performance issues. First, it will not be able to scale well as it saves the ratings matrix in memory and then predicts ratings. Presently there are 9 billion devices in IOT and it is estimated to reach 24 billion by 2020. Second, it suffers from data sparsity problems such as the cold start problem for nodes who haven’t provided any ratings yet.

To tackle the data sparsity problem that occurs when calculating similarity for prediction, instead of just pearson similarity, we calculate trust too. We trust the recommendations and ratings of our friends and relatives. Thus, we incorporated a social paradigm into IOT so that nodes can form relations and use them to calculate trust. We calculate two kinds of trust. One from ratings and the other from the network structure. For similarity, we use pearson similarity for ratings and jaccard coefficient for the network. We tackle the scaling issue by taking a decentralized approach. Each node does the rating prediction by itself. When a node X wants to predict it's rating for a node Y, it sends out a request to all it's friends asking them for their rating of Y and it's friends in turn send out requests to their friends. X then calculates it's similarity with the nodes who have rated Y in the past. If X and the node are directly connected we calculate direct trust, if not we calculate indirect trust using Goldbeck's approach in his TidalTrust algorithm.

We evaluated our proposed method by using the Libimseti dataset as we needed user-user ratings and not user-item ratings. We then added additonal attributes such as location, owner, manufacturer to our user nodes. These attributes are used by nodes to form SIOT relationships. Our method outperforms the traditional CF algorithm in RMSE, MAE and also coverage. Thus, a graph-based trust enhanced recommender system will be able to meet the needs of growing IOT structures and predict ratings with less error rate and good coverage.
Ana Marasovic and Anette Frank. Fine-grained opinion mining with an end-to-end neural model
Abstract: Fine-grained opinion mining aims to detect opinion expressions (O) in a text, measure their intensity and sentiment, and identify their targets (T) (entities or propositions at which the sentiment is directed) and holders (H) (entities that express the opinion) (Wiebe et al., 2005). In the following sample sentence, the opinion expression infuriated (O) indicates that there is a (negative) opinion from Beijing (H) regarding the sale (T). The relevant opinion entities (opinion expressions, targets and holders) are marked with O, T and H, respectively. Numerial subscripts indicate the relations between holders and targets to the opinion expression, as is-from (H) and is-about (T), respectively.

[The sale]T_1 [infuriated]O_1 [Beijing]H_1,2 which [regards]O_2 [Taiwan]T_2 an integral part of its territory awaiting reunification, by force if necessary. (Arzoo and Cardie, 2016)

The commonly accepted MPQA corpus (Deng and Wiebe, 2015) uses span-based annotation to represent opinion entities. Hence, prior work mostly tackled extraction of opinion entities and relations among them using traditional sequence labelling techniques (Choi et al., 2006; Breck et al., 2007; Yang and Cardie, 2013).
More recent work investigates deep learning approaches, to alleviate dependence on manually crafted feature sets, opinion lexicons, NER and other preprocessing components. In a first step, Irsoy and Cardie (2014) applied a bidirectinal deep recurrent neural network (RNN) to opinion expression extraction and outperformed state-of-the- art baselines. Recent work in Arzoo and Cardie (2016) extended this work to the joint extraction of all opinion entities and relations among them. However, this model does not achieve state-of-the-art performance. It faces difficulties with overlapping and embedded entities and relations and does not handle cross-sentential relations.
In our work we aim to construct a model which overcomes these drawbacks.
Some of the weaknesses in Arzoo and Cardie (2016)’s account can be attributed to the restricted training size of MPQA. As the task is similar in nature to semantic role labelling (SRL), we can exploit abundant data from SRL or even raw dependency parsing, and enhance the deep bi-RNN model using a framework of multi-task learning as proposed by Anders and Yoav (2016). Another recent model we are investigating is a modification to the LSTM that aims to better incorporate discourse context from preceding sentences (Tian and Kyunghyun, 2016). The basic idea is to keep dependencies within the sentence being modelled and those between the current and previous sentences separate from each other. Tian and Kyunghyun (2016) claim their novel way of feeding the LSTM unit can be beneficial to any encoder-decoder-based application.
We will contrast our novel approach with the state-of-the-art baselines on the benchmark dataset MPQA. The same model will also be employed in our contribution to the Source-and-Target Belief and Sentiment Evaluation (BeSt) at TAC 2016, a closely related task on sentiment and belief attribution. By the time of the workshop we will present first evaluations of our new model, on MPQA and BeSt.
Carrie Segal. Minimal Connectivity Graph for Hardware Artificial Neural Networks
Abstract: Visual recognition and language comprehension rely on underlying data-structures which interact with a CPU or GPU. Once sensor signals are translated into binary representations the input data is fed through various software architectures which are representative of data flow paths. [1],[2] The presented work creates generator matrices for the implementation of Gallager codes. The generator matrix is designed based on the software architectures of neural networks capable of unsupervised learning. The generator matrix is created from the weights of trained models. The weights are extracted and converted into parity check matrices used to encode and decode Gallager (low density parity) codes. The parity check matrix represents the possible code words available for a given set of constraints. The information capacity of the resulting code ensembles are analyzed and compared against existing state-of-the-art parity check codes. The reason for making this comparison is to find out if using a successful neural net as the basis for a Gallager code results in better than average information capacity.

The purpose for viewing the graph of the trained weights is to understand what minimum connectivity requirements are necessary for a hardware implementation of the software architectures. This work assumes the existence of a simple asynchronous local computation unit, similar to the processor tile as introduced by LeCun. Existing software architectures begin training on multiple layers of neurons, typically fully connected by weights from each neuron in a layer to every other neuron in the layer. The connections reduce over time as patterns are learned.

It is difficult for an integrated circuit designer to provide extremely flexible architectures for a software programmer to configure to their liking. By identifying the plausible configuration scenarios, a hardware implementation does not have to begin the training with full connectivity. The trained weight reduced graph, lets it become apparent what sort of connectivity is minimally required to allow a software architecture to be implemented with a faster and low power hardware ASIC.

[1] Farabet C. Lecun Y. Culurciello E., NeuFlow: A Runtime Reconfigurable Dataflow Architecture for Vision, CVPRW 2011
[2] TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org
Cristina Segalin, Dong Seon Cheng and Marco Cristani. Social Profiling through Image Understanding: Personality Inference using Convolutional Neural Networks
Abstract: In this work, we examine the problem of relating a set of image preferences to personality traits by using a deep learning framework. We cast this recently introduced application problem as a new level of image understanding that enhances the role of images through considerations on the social aspects of contemporary online activities. The role of social platforms like Flickr, Facebook, Instagram, etc., in building online social personas where most activities are shared to a wide audience, creates a unique opportunity to study image-based activities (like authoring, uploading and preferring images) as social messages, embedded with characteristics resembling verbal and non-verbal signals in face-to-face spoken communications.
Moreover, the presence of an author-audience paradigm imbues the messages with an extra layer of significance: there is the communication intended by the author, and there is the communication assumed by the audience. Therefore, in the same way that an email, a blog or a chat reveals something of its author, so now the images authored by a person may embed some of her individual traits, such as her personal aesthetic preferences or her personality traits. Thus, our problem becomes inferring both self-assessed (how the personality traits of a person can be guessed from her preferred image) and attributed traits (what impressions in terms of personality traits these images trigger in unacquainted people), learning a sort of wisdom of the crowd.

Differently from many studies that describe pictures through an array of feature extractors and scene analyzers, our characterization of each image is locked within the layers of a CNN. This allows us to discover more entangled attributes, unattainable by simple hand-crafted features because of the interaction between many different factors, and to better generalize the patterns that identify a trait. Experiments have been focused on the PsychoFlickr corpus: the dataset provides 60,000 images, 200 ``favored'' images from 300 Flickr users. The personality profile of each user is described in terms of the Big Five traits. Additionally we collected this information from 12 unacquainted assessors rating the image sets of each user.

The experimental results show that the proposed method is promising, outperforms state-of-the-art results and captures what characterizes a certain trait: using a deconvolution strategy we found a clear distinction of features, patterns and content between low and high values in a given trait. We also introduce an online application demo that uses our trained classifiers to predict personality traits given a proposed set of uploaded or selected pictures liked by a subject.

A possible application of this research might be providing social networks with tools to soft-profile users, suggesting compatible users to connect with, or used as a marketing evaluation tool to help predict the impact of a set of images on a hypothetical audience of customers. Our online demo is a step along this direction, evidence that this new kind of image understanding is not only a mere academic research endeavor, but a potential groundbreaking market application.
James Bedell, Ayush Jaiswal and Elizabeth Staruk. Generating Song Lyrics with Style
Abstract: Natural language generation is fundamental to dialogue systems, virtual humans, creative storytelling, and machine translation. Creative domains can add a human element to these systems, leading to enhanced generated language with more personality. It is questionable to suggest that artificially intelligent language contains self-expression. However, it can mimic the appearance of self-expression by exploiting the vast corpora of creative prose available. We consider music lyrics—a domain which exhibits both linguistic constraints and artistic freedom—and generate them automatically using machine learning techniques.

We take a top-down generative approach to automatically create lyrics that resemble an artist’s style in both content and structure. In keeping with the spirit of self-expression, we develop these models independently for each artist, and evaluate our method based on whether or not the source artist for the generated lyrics can be identified. We automate the training data annotation process by using heuristics that assign recognizable patterns such as rhyming or repeating stanzas (as in a chorus). These annotations guide our models, which are trained on different levels of granularity: the overall song structure, which is determined by a trigram language model; the stanza structure, which is sampled from probability distributions trained on each artist; and the line structure, which uses parallel trigram models with linear interpolation of bigram and unigram models in order to preserve lyrical fluency. The generated output are then ranked using a style function that is trained separately on each individual artist. The result is a composite model that generates expressive, creative pieces in the style of a particular artist.

In our pilot study, participants (N = 32) were given generated lyrics and obscure, real lyrics from artists which they claimed to be familiar with. They were then asked to identify the source artist for both lyrics, and to rate the perceived fluency of the generated lyrics. We found that participants were able to identify the correct artist for the generated lyrics as well as they were able to for the obscure but real lyrics (21/32 vs. 20/32 times). On average, participants were somewhat satisfied (mean = 2.97 on a Likert scale of 1-5) with the fluency of the generated lyrics. Our findings suggest that the composite model successfully creates lyrics with some of the individual flavor of a source artist.
Deeptha Girish, Vineeta Singh and Anca Ralescu. Determining the cluster number in the K-means algorithm
Abstract: Image segmentation is one of the most important processes in image analysis whose goal is to divide the image into non-overlapping clusters such that all pixels in one cluster share some common property. For example, in medical images a particular color represents a particular stain. These stains are actually protein binders that bind with a certain kind of molecule. Therefore, each color in such an image represents a certain biological structure. Using a color based segmentation algorithm on such an image identifies different biological structures present in the image, which can then be used for extracting features such as shape and size.

The K-means algorithm is very good for color based image segmentation and is popularly used. The main disadvantage of k-means algorithm is that the number of clusters has to be defined in advance. This makes it difficult to segment thousands of images at once or when a large number of images have to be batch processed. Each image in the dataset might be different and it is impossible to set the number of clusters before running the k-means clustering algorithm for every image. The number of clusters depends on the data itself and the size of the data set. If the number of clusters is small, a lot of valuable information might be lost and the goal of identifying and segmenting certain objects is not achieved. Increasing the number of clusters to the number of data points will result in zero error, but the data points in this case are actually not 'clustered', and the purpose of clustering is defeated. An optimal number of clusters has to strike a balance between increasing the accuracy and keeping the number of clusters as low as possible. Our paper deals with the problem of determining the optimal number of clusters.

In current methods to determine the number of clusters in an image, the k-means algorithm is implemented several times on a single image and different performance measures are calculated for each choice of the number of clusters. The number that gives the best value for the performance measure is chosen as the optimal number or the number of clusters after which increasing the clusters does not change the performance drastically is chosen as optimal (elbow method). This is a time consuming and inefficient approach.
To tackle this problem and eliminate human interference, we propose an algorithm which starts with two clusters. Each data point is assigned to a cluster that it is closest to. If a data point is not ‘close enough’ to any of the current clusters, a new cluster is added. Numerous experiments are performed to justify our quantification of the idea of ‘close enough’. This algorithm is also tested on other datasets with known number of clusters for evaluating its correctness.
Ruoxi Sun. Variational inference for super resolution microscopy
Abstract: Super resolution (SR) microscope imaging, such as STORM and PALM, won Nobel Prize in 2014 for bringing "optical microscopy into the nanodimension". These SR methods overcome the light diffraction barrier of traditional optical microscopy, and enable researchers to ask questions previously considered inaccessible. Given a sample treated with a fluorescent dye, the basic strategy is to stochastically (and independently) activate fluorophores at a rate that guarantees that only a sparse subset are activated at a given time. One then repeatedly images the sample to obtain a movie wherein each frame reflects a random, sparse set of fluorophore activations. One exploits the sparsity of activations within each frame to localize the positions of the expressed fluorophores; aggregating a long sequence of such point localizations then yields a super-resolved image. Therefore, the improvement in resolution comes with the challenge in computational analysis.

Many approaches have been proposed to improve localization performance within each individual frame. For example, several modern techniques take a “compressed sensing” approach that exploits the true sparsity of the underlying fluorophore activations; these techniques result in a formulation as a regularized sparse deconvolution problem, providing scalable and fairly accurate reconstructions. However, these approaches are sub-optimal because each frame is reconstructed independently, thereby discarding information that should be shared across frames. Intuitively, given N reconstructed frames, we should have a good deal of prior information about the locations of fluorophores on the N-th frame, and ignoring this information will in general lead to highly suboptimal estimates.

Here we propose a scalable Bayesian approach that properly pools information across frames and can also incorporate prior information about the image, extending the reach of super resolution microscopy even further. Our method couples variational inference techniques with a data summarization based on Laplace approximation to ensure computational scalability. Our formulation makes it straightforward to incorporate prior information about the underlying sample to further improve accuracy. The proposed method obtains dramatic resolution improvements over previous methods while retaining computational tractability.

Details: https://drive.google.com/a/columbia.edu/file/d/0B81bH90WR2mMcEltM2JoRU1iNkk/view?usp=sharing
Diana Borsa, Thore Graepel and John Shawe-Taylor. Learning Option-like Features through Multi-task Value Functions
Abstract: We investigate a paradigm in multi-task reinforcement learning (MT-RL) in which an agent is placed in an environment and needs to learn to perform a series of tasks. Since the environment does not change, there is potentially a lot of common ground among tasks and learning to solve them individually may be wasteful. In this paper, we explicitly model and learn this shared structure as it arises in the state-action value space. We show how to jointly learn optimal value-functions by modifying the popular value-iteration and policy-iteration to accommodate this shared representation and leverage the power of multi-task supervised learning.

More concretely, we propose learning a shared embedding of the state-action space that spans the space of (near-) optimal value functions. We refine this representation of the input space at each step of value/policy iteration by moving into the direction that best optimizes the joint TD error across tasks. Gradually, as the value functions converge, we obtain a very low-dimensional representation of the state-action space that captures shared optimal behaivour.

We demonstrate that the proposed model and training procedures, are able to infer good value functions, even under low samples regimes, consistently outperform single-task counterparts. We test these in a series of navigation tasks(like that 4-room problem) and on learning to execute financial derivatives (optimal stopping problem). In both scenarios we obtain a considerable improvement in the quality of the policy inferred as well as the value function approximation.

In addition to data efficiency, we will show in our analysis, that learning abstractions jointly across tasks leads to more robust, transferable representations with the potential for better generalization.
Furthermore, by examining the representation learnt for the navigation tasks, we see that they encode option-like sub-plans of navigating in the environment -- they span the value functions corresponding to navigating from N to S, E to W, or to any particular room in the environment.

D. Borsa et al. Learning Shared Representations for Value Functions in Multi-task RL.
D. Calandriello et al. Sparse multi-task reinforcement learning. In Advances in Neural Information Processing Systems, pp.819–827, 2014.
ME Taylor et al. Transfer learning for reinforcement learning domains: A survey. JMLR, 10:1633–1685, 2009
Lea Frermann. A Bayesian Model of Joint Category and Feature Learning
Abstract: From day one, infants are exposed to a complex world, and they need to acquire an extraordinary amount of knowledge in order to be able to understand their environment and react meaningfully to it. How do they acquire and represent this knowledge? Structured mental representations, in terms of categories (e.g., animal, furniture) of concepts (e.g., dog, chair) have been shown to underlie fundamental cognitive abilities such as learning and using language, and influence the way humans perceive and react to their environment.

We develop the first computational model which investigates process with which children acquire categories, and their associated features. Computational models of cognitive phenomena allow to systematically investigate the influence of the input and processing constraints, and to draw conclusions about human cognitive processing in general, and category learning in particular.

Our model captures three important characteristics of child category acquisition. First, categories and their features are acquired jointly, and the two aspects mutually influence each other. Second, features of categories are structured into feature types, which are shared across categories (e.g., animals have characteristc behaviors; tools have characteristc functions; and both categories have characteristic appearance). Finally, learning proceeds incrementally: children immediately integrate, and utilize, novel information of the input they receive from their environment.

We formalize the above characteristics in a Bayesian model which acquires (a) categories, (b) feature types, and (c) category-feature type associations from linguistic input. We approximate the learning environment of the child with child-directed language. While this ignores other modalities (e.g., visual or pragmatic), learning from text corpora allows us to train and test our model on a large scale. We learn categories (e.g., animal) of concepts (e.g., dog, cat) from linguistic mentions of concepts in their local context, which serves as an approximation of the concepts' features. Concepts with similar features are assigned the same category. Our model represents both categories and feature types as clusters of words.

Given a corpus of concept mentions in context, for each input we (1) draw the category of the observed concept from a global distribution over concepts $k \sim p(\theta)$; (2) draw a feature type from the category-specific distribution over feature types $g \sim p(\phi|k)$; and (3) draw a set of features from the feature type-specific distribution over features $f_i \sim p(\psi|g)$. Importantly, we estimate the parameters of our model incrementally with particle filters, an incremental Monte Carlo method. Our model sequentially observes input data (ordered wrt. the age of the addressed child), and updates its parameter estimates on-the-fly with the novel information. Concretely, it maintains a set of parameter samples (`particles') which are individually updated (propagated through time).

Evaluation shows that our model captures important characteristics of the learning process. Intuitively meaningful categories, feature types and their associations emerge. The quality of the learnt clusters improves steadily with incoming data. We also show that our model learns successfully under processing constraints which approach the cognitive constraints that humans are subjected to.
Koushiki Bose, Wen-Xin Zhou, Jianqing Fan and Han Liu. Finite Sample Properties of Robust M-Estimation with Applications to Covariate-Adjusted Large-Scale Multiple Testing
Abstract: High-dimensional data are often automatically collected with low quality. Even moderate-tail distributions can generate very large outliers in high dimensions. Therefore, robust procedures are even more important for high-dimensional problems. In this work, we develop a new finite sample theory for robust M-estimation. Such a finite sample theory is motivated by contemporary problems of simultaneously testing many hypotheses. In these problems, our goal is either to control the false discovery rate (FDR) or to control the familywise error (FWER). We apply our finite sample theory to analyze a covariate-adjusted multiple testing procedure under moderately heavy-tailed errors, by using an approximate factor model.

When errors in a regression model are heavy-tailed, more robust methods like the Huber estimator [1] are needed to replace the widely used ordinary least squares (OLS) estimator. However, unlike the OLS estimator, all the existing theoretical results for the Huber estimator are asymptotic, including asymptotic normality and the Bahadur representation [2]. The first contribution of this work is to develop a nonasymptotic Bahadur representation of the Huber estimator along with an exponential-type concentration inequality under mild moment conditions on the error variables. The result further yields a number of important normal approximation results, including the Berry-Esseen bound and Cramer-type moderate deviation. These probabilistic results are particularly useful in investigating the robustness and accuracy of the P-values in multiple testing.

Hence as a second contribution, we deal with two important challenges of large-scale multiple testing: (i) dependence between test statistics and (ii) heavy tailedness of the data. To characterize the dependence, we employ a multi-factor structure model [3]. To achieve robustness, we propose a Huber loss based approach to construct test statistics for testing the individual hypotheses. We prove that, under mild conditions, our procedure asymptotically controls the overall false discovery proportion at the nominal level.

Thorough numerical results on both simulated and real world datasets are also provided to back up our theory. We use stock market data from the S&P 500 index to identify stocks that have excess returns, unexplained by the efficient market hypothesis. We also apply the proposed procedure to the analysis of a neuroblastoma data set to identify differentially expressed genes between different groups of patients. It is shown that our newly proposed robust, dependence-adjusted method performs well numerically in terms of both the size and power. It significantly outperforms the conventionally used multiple t-tests under strong dependence, and is applicable even when the true error distribution deviates wildly from the normal distribution.

[1] Huber, P. J. (1964). Robust estimation of a location parameter. Ann. Math. Statist. 35 73–101.
[2]He, X. and Shao, Q.-M. (2000). On parameters of increasing dimensions. J. Multivariate Anal. 73 120–135.
[3] Fan, J., Han, X. and Gu, W. (2012). Estimating false discovery proportion under arbitrary covariance dependence. J. Amer. Statist. Assoc. 107 1019–1035.
Ramya Korlakai Vinayak and Babak Hassibi. Crowdsourced Clustering: Querying Edges vs Triangles
Abstract: Collecting data from non-expert workers on crowdsourcing platforms such as Amazon Mechanical Turk, Zooinverse, Planet Hunters, etc. for various applications has recently become quite popular. Applications range from creating a labeled dataset for training and testing supervised machine learning algorithms to making scientific discoveries. Since the workers on the crowdsourcing platforms are often non-experts, the answers obtained will invariably be noisy. Therefore the problem of designing queries and inferring quality data from such non-expert crowd workers is of great importance.

We consider the task of clustering items using answers from non-expert crowd workers. In such cases, the workers are often not able to label the items directly, however, it is reasonable to assume that they can compare items and judge whether they are similar or not. As an example, consider the task of collecting labels of images, e.g, of birds or dogs of different kinds and breeds. To label the image of a bird, or dog, a worker should either have some expertise regarding the bird species and dog breeds, or should be trained on how to label each of them. Since hiring experts or training non-experts is expensive, we shall focus on collecting labels of images through image comparison followed by clustering. Instead of asking a worker to label an image of a bird, we can show her two images of birds and ask: ``Do these two birds belong to the same species?" Answering this comparison question is much easier than the labeling task and does not require expertise or training. Though different workers might use different criteria for comparison, e.g, color of feathers, shape, size etc., the hope is that, averaged over the crowd workers, we will be able to reasonably resolve the clusters.

An important question is what queries to make, and we compare two types: random edge queries, where a pair of items is revealed, and random triangles, where a triple is. Since it is far too expensive to query all possible edges and/or triangles, we need to work with partial observations subject to a fixed query budget constraint. When a generative model for the data is available (and we consider a few of these) we determine the cost of a query by its entropy; when such models do not exist we use the average response time per query of the workers as a surrogate for the cost. In addition to theoretical justification, through several simulations and experiments on two real data sets on Amazon Mechanical Turk, we empirically demonstrate that, for a fixed budget, triangle queries uniformly outperform edge queries. Even though, in contrast to edge queries, triangle queries reveal dependent edges, they provide more reliable edges and, for a fixed budget, many more of them. We also provide a sufficient condition on the number of observations, edge densities inside and outside the clusters and the minimum cluster size required for the exact recovery of the true adjacency matrix via triangle queries using a convex optimization-based clustering algorithm.

This work will appear in NIPS 2016.
Marta Kryven and William Cowan. Semi-Automated Classification of Free-form Participant Comments
Abstract: Behavioral experiments with human participants often request free-form feedback, asking open-ended questions, such as, `How did you make your decisions during the experiment?' or `Why did you choose alternative A over alternative B?'. Free-form questions are valuable for understanding the participants' thought process and interpretation of the task. Patterns in the answers, however, are rarely obvious. The usual approach to analyzing free-form comments is manual coding. The researcher comes up with a set of categories of comments by looking through the data-set. Then, two or more independent raters code each comment by assigning a category label from this set of categories. Inter-rater agreement, the percent of entries on which the two raters independently agree on the same label, measures the quality of the proposed categories.

Coming up with good categories from free-form natural language comments is, however, hard. Participants express themselves using different styles of language, synonyms or metaphors, where a defining a category set requires extracting a small set of word features associated with each label. We describe and implement an algorithm that partially automates this process. The algorithm combines natural grammar processing, synonym analysis and decision trees to produce a tree of word features exposing the semantic structure of the content to aid category analysis.

The algorithm first processes the text to generate a set of word features and then trains a decision tree. To extract the word features, the individual comments are split into tokens and tagged with a part-of-speech (POS) tags from the Penn Treebank tag set. Verbs, nouns, adjectives and adverbs are retained. Next, words are converted to a neutral form using a stemming algorithm, such as Porter's algorithm, or a natural language lemmatization. An evaluation of stemming algorithms in the context of comment classification suggests that lemmatization is somewhat more effective. Next, we convert each word to its most common synonym, given the word's POS, using Princeton WordNet®. The algorithm retains the set of synonyms of each word (the Synset) and merges word feature vectors if their synsets overlap. It also incorporates user-defined context-dependent synonyms. The pre-processing allows us to reduce the set of potential unique word features to 30% of the original number of unique words occurring in the text.

Next, the system trains a decision-tree using the set of recovered word features, and calculates the chi-square criteria and the list of examples at each node. The tree is presented in an interactive graphical user interface allowing the researchers to examine the structure of the data in a concise form and come up with general category labels more easily. Although the system was initially designed for analyzing comments in behavioral experiments, it can be equally applied to other snippets of text, such as online reviews, queries, or social media content.
Junli Gu, Yibing Liu, Yuan Gao and Maohua Zhu. OpenCL caffe: Accelerating and enabling a cross platform machine learning framework
Abstract: Deep neural networks (DNN) achieved significant breakthrough in vision recognition in 2012 [1] and quickly became the leading machine learning algorithm in Big Data based large scale object recognition applications. The successful deployment of DNN based applications pose challenges for a cross platform software framework that enables multiple user scenarios, including offline model training on HPC clusters and online recognition in embedded environments. Existing DNN frameworks are mostly focused on a closed format CUDA implementations, which is limiting of deploy breadth of DNN hardware systems.

This paper presents OpenCL caffe, which targets in transforming the popular CUDA based framework caffe [2] into open standard OpenCL backend. The goal is to enable a heterogeneous platform compatible DNN framework and achieve competitive performance based on OpenCL tool chain [3]. Due to the hierarchical design of caffe and DNN algorithm’s high complexity, we use a two-phase strategy. First we introduce the OpenCL porting strategies that guarantee algorithm convergence. Due to the key differences of OpenCL vs CUDA, the OpenCL porting requires some re-design efforts in the hardware abstraction layers, to handle hardware resources allocation and utilizations to guarantee algorithm correctness and the best performance, e.g. CPU-GPU task assignment, memory management, data transfer, and computing resources utilization. Then we analyze OpenCL's performance bottlenecks in DNN domain and propose a few optimization techniques including batched manner data layout to unroll convolution data layout boost data parallelism and multiple command queues to boost task parallelism across kernels. These optimization techniques better map the problem size into existing BLAS library, improve hardware resources utilization and boost OpenCL runtime efficiency.

We verify OpenCL caffe's successful offline training and online recognition on both server-end and consumer-end GPUs. The user can directly run mainstream DNN models and achieves the best performance for a specific processor by choosing the optimal batch number depending on H/W properties and input data size. Experimental results show that the phase-two's optimized OpenCL caffe achieved a 4.5x speedup without modifying BLAS library. We evaluate OpenCL caffe's performance on mainstream AMD server-end and consumer-end graphic cards. E.g. for AlexNet model with minibatch size 128, AMD latest Fiji GPU R9 Fury has achieved offline training speed of 261 images per second and online recognition of 699 images per second. We also provided apple-to-apple comparison of caffe with CUDA backend v.s. caffe with OpenCL backend. Compared to the highly optimized machine learning library cuDNN library, which has adopted a lot of low level optimizations, OpenCL caffe still has a performance gap of 2x. Looking forward, we provide on-going optimizations with further optimizations to achieve the competitive performance with cuDNN.

As on-going efforts collaborated with caffe developers, cross platform portability and seamless merge with CUDA caffe are the major goals of OpenCL caffe. We analyze the current capabilities of OpenCL and limitations including differences in specific manufactures' extension and versions. We also summarize the key challenges in merging the OpenCL backend with CUDA backend into the same caffe framework to benefit machine learning community.
Kelsey Allen, Luke Hewitt, Jiajun Wu and Josh Tenenbaum. Analysis by synthesis for speech recognition
Abstract: Over the first few years of life, most human beings learn to both produce and understand speech. How these processes interact is still mostly a mystery. One hypothesis is that when we perceive speech, we invert the generative process that led to that sound being made. This is sometimes referred to as analysis by synthesis. The hypothesized benefits of this approach include being able to generalize across languages not encountered before, or reproduce words that are unfamiliar to you (like the name of a colleague you have just met). For both children and adults, this capability is crucial for interacting with new people and for being able to learn new vocabulary. However, inference in these generative models is very slow, and therefore generally intractable for real world speech.

On the other hand, neural networks have shown great success for speech recognition when there is a significant amount of training data for a language or speaker. We hypothesize that by combining a more flexible generative model with neural networks, we may be able to improve generalization to novel words and languages. From a cognitive perspective, this also gives an integrated account of speech production and perception by suggesting that perception is simply the process of inverting the production mechanism.

For our generative model, we use the Festival synthesis system [1] which takes as input a set of phonemes and their durations, and outputs a waveform. To model human speech acquisition, we assume a single fixed voice produces the sounds.

In order to train the neural network, we synthesized words from 12 different voices in Festival, and supplemented these with real English speech from the TIMIT dataset. The output categories are English phonemes, while the inputs are the Mel-Frequency Cepstral Coefficients extracted from 25 ms frames of speech. For a given word, there is thus a series of phoneme predictions which are strung together using a Hidden Markov Model to capture linguistic dependencies. The result is a neural network which is capable of giving a sequence of phonemes with durations for a given input word.

We then refine the predictions from the neural network by performing inference in the generative model (under a single speaker). The standard Metropolis-Hastings algorithm is used, with proposals corresponding to phoneme merges, splits, and substitutions. The combined system is then tested on a variety of speech tasks, attempting to mimic the human ability to produce one-shot imitations of unfamiliar words, or words spoken in other languages, by novel speakers. We find that even without substantial inference, the system is able to produce sounds which resemble the input for a non-word English dataset, as well as a Japanese dataset. This breaks down if the neural network is not explicitly trained on synthesized voices. We are now performing perceptual experiments on mechanical turk to quantify the improvement from incorporating the generative model for generalization.

[1] Black, Alan, et al. "The festival speech synthesis system, version 1.4.2.” (2001).
Ronghui Yan, Cheng Wu and Yiming Wang. Exploration and Evaluation Driving Fatigue for High-speed Train: A Parametric SVM Model Based on Multidimensional Visual Cue
Abstract: In recent years, high speed railway is rapidly growing in the world, especially in China. High-speed railway needs a higher safety level than other public transports because of more passengers and higher speed. It not only needs higher standards of hardware facilities, but also meets stricter demands to train drivers. It is well known that train drivers play a key role on the rail operation and their physiological and psychological states are important factors to ensure driving safety.
Sleepiness and fatigue at work occur frequently in rail transportation due to longer operation time, irregular shift schedule, high psychological and cognitive requirements and so on. In order to monitor train drivers' fatigue state, we used eye behavior characteristics as indicators of fatigue.
16 participants with traffic engineering disciplines took part in two train driving simulation tests, which were conducted in the morning (participants had a good sleep the night before) and afternoon (participants were deprived of 8 hours of sleep the night before). Eye movement data were collected by Tobii X2-60 remote eye tracker and sleepiness was assessed using the Karolinska Sleepiness Scale (KSS) and sleep diaries. Vigilance performance and attention were measured by detection distance, which was defined as the track distance between detecting the safety critical event and passing the same event. A t-test was performed to exam sleep time, sleep quality, KSS and detection distance between two driving tasks.
Raw eye data was separated into segments using time window of 30s and overlap is 50%. After preprocessing, three features of eye movement for driver fatigue detection were extracted: pupil diameter, blink time and fixation percentage. A t-test was performed on each feature of each participant to assess significant difference between sober and fatigue state.
In our work, we attempted to classify two classes for each subject using the Support Vector Machine (SVM) classifier. Driver sleepiness affects driver behavior in a subtle, inconsistent manner, which can be easily submerged by individual differences. Thus the fatigue classification model should be driver-specific. We trained a SVM model for each driver. As a result, a total of 16 driver-specific SVM models were developed. Gaussian Radial Basis Function (RBF) kernel was chosen as the kernel function for the SVM models. Grid-search method and 5-fold cross validation method were applied to search for optimal penalty parameter and the kernel parameter.
By analyzing self-report sleepiness scales, detection distance and eye movement data, we found that pupil diameter, blink time and fixation percentage of fatigue drivers were significantly different from that of sober drivers. Meanwhile, we revealed obvious individual difference of train drivers. The mean accuracy for the driver-specific SVM models was 89.79%, which demonstrated that pupil diameter, blink time and fixation percentage can be used to detect drivers fatigue state reliably. These quantitative results can be used as a preliminary study for designing human-train interface of high-speed railway.
Yoshua Bengio, Anirudh Goyal and Nan Rosemary Ke. The Variational Walk Back Algorithm
Abstract: A recognized obstacle to training undirected graphical models with latent variables such as Boltzmann machines is that the maximum likelihood training procedure requires sampling from Monte-Carlo Markov chains which may not mix well, in the inner loop of training, for each example. We first propose the idea that it is sufficient to locally carve the energy function everywhere so that its gradient points in the “right” direction (i.e., towards generating the data). Following on previous work on contrastive divergence, denoising autoencoders, generative stochastic networks and unsupervised learning using nonequilibrium dynamics, we propose a variational bound on the marginal log-likelihood of the data which corresponds to a new learning procedure that first walks away from data points by following the model transition operator and then trains that operator to walk backwards for each of these steps, back towards the training example. The variational bound relies on gradually increasing temperature as we walk away from the data, at each step providing a gradient on the parameters to maximize the probability that the transition operator returns to its previous state. Interestingly, this algorithm admits a variant where there is no explicit energy function, i.e., the parameters are used to directly define the transition operator. This also eliminates the explicit need for symmetric weights which previous Boltzmann machine or Hopfield net models require, and which makes these models less biologically plausible.
Tegan Maharaj and Christopher Pal. Fill-in-the-blank QA for large-scale movie dataset understanding
Abstract: Deep convolutional networks excel at image recognition and classification tasks, sometimes exceeding human-level performance. With huge amounts of text data available on the internet and via smartphones, natural language understanding has also advanced dramatically in the recent past. There are many successful and interesting applications combining the two technologies, including image captioning, storytelling, automatic tagging, question-answering, etc. In the last few years, as it has become more computationally feasible and research on recurrent neural networks has progressed, these applications have naturally been extended to video. It remains, however, extremely time-consuming and computationally intensive to train models on video, and compared to image data, the amount of labelled data for tasks such as recognition is relatively small, though growing rapidly. The LSMDC (Large-Scale Movie Description Challenge) dataset is one large dataset, released in 2015, with high-quality annotations for scenes from Hollywood movies, transcribed from descriptive video services for the visually imparied. Evaluating natural language descriptions of video with humans can be expensive, and as discovered in the 2015 challenge, it is advantageous for models to describe scenes in as vague a way as possible in order to be scored as 'correct'. In order to address the need for an inexpensive, quantitative evaluation criteria, as well as to provide a more targetted, simple recognition task, we introduce a fill-in-the-blank version of the dataset. Blanks include nouns, verbs, adjectives and adverbs. This dataset forms one track of the 2016 LSMDC Challenge, along with movie description and movie retrieval. This work provides natural language, natural language + image, and natural language + video baselines for comparison, and also presents an analysis of the blanked words and relative performace of each model on different semantic categories. With over 300 thousand examples, we hope that this dataset will serve as a useful benchmark for evaluating video models and better understanding where and why they excel and fail.
Lisa Anne Hendricks, Zeynep Akata, Marcus Rohrbach, Jeff Donahue, Bernt Schiele and Trevor Darrell. Generating Visual Explanations
Abstract: Clearly explaining a rationale for a classification decision to an end-user can be as important as the decision itself. Existing approaches for deep visual recognition are generally opaque and do not output any justification text; contemporary vision-language models can describe image content but fail to take into account class-discriminative image aspects which justify visual predictions. We propose a novel model to generate visual explanations. Our model focuses on the discriminating properties of the visible object, jointly predicts a class label, and provides a natural language explanation for why the predicted label is appropriate for the image. In order to train our explanation system, we propose a novel loss function based on sampling and reinforcement learning. Our results on a fine-grained bird species classification dataset show that our model is able to generate explanations which are not only consistent with an image but also more discriminative than descriptions produced by existing caption methods.

Visual explanations must fill two criteria: class specificity and image relevance. Visual descriptions include image relevant text, but may fail to include class discriminative information important to a classification decision. Class definitions include class specific text, but may discuss features which are not present in a specific image instance. In contrast, our explanations are more class discriminative than image descriptions and more image relevant than class definitions.

Key to our model is our novel loss function which encourages explanations to include class discriminative text. One challenge in designing a loss to optimize for class specificity is that class specificity is a global sentence property: e.g., whereas a sentence “This is a black bird with a white nape” is class specific to a “White-Necked Raven”, words and phrases in the sentence, such as “black” or “white nape” are less class discriminative on their own. Our proposed loss enforces that generated sequences fulfill a certain global property, such as category specificity. Our loss operates over a sampled sentence and we backpropagate this loss via a technique from the reinforcement learning literature. While typical sentence generation losses optimize the alignment between generated and ground truth sentences, our discriminative loss specifically optimizes for class-specificity.

Our explanation model outperforms all baseline models on a variety of automatic metrics and human evaluations. Qualitatively, our explanation model discusses class relevant features which are not discussed by description models. For example, the “Bronzed Cowbird” has startling red eyes. However, a description model neglects this important feature when outputting a sentence (“This is a bronzed cowbird because this bird is nearly all black with a short pointy bill.”) In contrast, our explanation model mentions the red eyes (“This is a bronzed cowbird because this is a black bird with a red eye and a pointy black beak.”)

Visual explanation is a rich research direction, especially as the field of computer vision continues to employ and improve deep models which are not easily interpretable. Our work is an important step towards explaining deep visual models.

Full text: LA Hendricks, et al. Generating Visual Explanations. ECCV 2016.
Deep learning based generative models for statistical assessment of macromolecular structures detected inside single cells
Elaheh Momeni and Reza Rawassizadeh. Multimodal Prediction of Individual Mood
Abstract: The last two centuries saw innovative evolution in the field of healthcare. Moods are central to the expression of ideas and feelings, and in turn impact health and attitudes. Continuous individual mood monitoring can help in early detection and efficient tackling of health and especially phycological issues. The promise of "Mood Predictor" using mobile sensing has long been championed, however, different issues limit the usage of these solutions in practice: First, many of these solutions only consider mobile and wearable devices, which are often restricted to a specific configuration that has been designed for a unique study goal. These approaches do not address all the real-world challenges such as battery related issues, uncertainty and data loss, and demotivation affected by manual intervention required by users. Consequently, users' activities and interactions can be also sensed implicitly from their online social interactions. Therefore, exploring different combinations of various sensors from different sources enable the development of the next generation of "Mood Predictor".

Second, users have different behaviours and habits which change for each individual user based on different context, time and environment, therefore a static training set for creating a mood predictor is not accurate and effective.

Third, mood notifications are increasingly used to send alerts and reminders to users to enable learning from user mood and context. However, many notifications are neither useful nor relevant to users' interests and, therfore, they are considered disruptive and obtrusive.

We propose a generic sensing module and a novel energy-efficient, on-device mood prediction framework that is based on a semantic abstraction approach to convert sensor data into meaningful information objects, similar to human perception of a behaviour.
The proposed approach uses: (1) a multi-modal sensing model via smartphones, wearable, and social media interactions. (2) an online learning, which enables the system to adapt itself according to the potential changes in users' behaviour over time and tackle the problem of limited training data. (3) a novel feedback management solution that learns users' preferences for receiving notifications and feedbacks based on automatic extraction of features by mining their interaction with mobile phones, wearables, and social media. Furthermore, our work extends previous approaches for context sensing and prediction on wrist-mounted wearables that perform predictive analytics outside the device, therefore, our work overcomes privacy issues as all the processes perform on-device.

We analyse the energy efficiency of our framework modules through six in-the-wild evaluation, identify the optimal file structure for data access and demonstrate an increase in accuracy of prediction by exploiting social media interactions and our semantic abstraction method. The proposed framework is hardware independent and can serve as a reference model for the implementation of context sensing and prediction on small wearable devices beyond smart watches, such as body-mounted cameras.
Rachita Chhaparia, Deepanshu Gupta and Amitabha Mukerjee. Creating Sense Vectors through Cross-lingual Data
Abstract: In this project we propose a novel approach to the problem of Word-Sense Disambiguation i.e. the task of automatic identification of the sense of a polysemous word in a given context. A related problem, described below, is a single representation of these polysemous words in the word vector space. In the vector space setting, words of similar semantics are distributed closely. However, within such a setting, we often find that the word space distribution is polluted owing to the different meanings and uses of certain words, called polysemous words, in different contexts. One solution is to create a separate vector for each sense of a polysemous word, called sense vectors.
In this project, we formulate a novel cross-lingual approach to create sense vectors for use in word sense disambiguation. To disambiguate in a language(target), we use another less related language (source; having different proto-languages) based on the following two assumptions:
(1)Word vectors of similar words in two different languages are related by a linear transformation. This is based on the work by Mikolov et al[1].
(2)A polysemous word in one language translates to a set of distinct non-related words in another language.
The transformation matrix (W) that linearly relates word vectors in two languages can be learned by using a bilingual dictionary D containing top 5000 most frequent words in the source language as the training set and performing stochastic gradient descent.
In our project, we use Hindi as the source language and English as our target language, such that given the vector representation, of a word in Hindi, W gives a vector in the English word vector space, such that they are semantically similar.
For every English polysemous word, the vectors of its translated senses in Hindi are mapped back to English which are the corresponding sense vectors for the word in English. For every target word, we calculate the cosine similarities between the context vector and each of its sense vectors created above. The sense whose sense vector has the maximum cosine similarity is chosen as the correct sense.

This approach has shown good results given that it is very simple and extensible. Also, this approach can be applied to any pair of languages hence, can be used to disambiguate in any language.

1. Mikolov, Tomas, Quoc V. Le, and Ilya Sutskever. "Exploiting similarities among languages for machine translation." arXiv preprint arXiv:1309.4168 (2013).
Olga Slizovskaia, Emilia Gomez and Gloria Haro. Automatic musical instrument recognition in audiovisual recordings by combining image and audio classification strategies
Abstract: Musical instrument recognition is one of the eldest tasks in music information retrieval. During the time, researchers achieved significant progress in solo phrase classification, but more complex polyphonic cases are still a challenging problem and steady source of work. Most of the musical instrument recognition techniques are based on the only audio information. However, if we consider audiovisual recordings, such as music performance videos, we have additional data stream to take care about. Hereby, one of the possible improvements is to include visual information to complement well-known audio-based methods.

In this work, we combine standard feature-based approach for musical instrument classification in an audio domain with state-of-the-art convolutional neural network object recognition architecture in a visual domain. At first, we explore the quality of two audio-based classifiers (SVM and gradient boosting) on several datasets (RWC Musical Instruments, IRMAS) with the number of classes from 11 to 50 and the number of excerpts from 3544 to 6705. At the same time, we perform finetuning of the pretrained on the ImageNet dataset VGG-16 network in the context of musical instrument recognition. For the finetuning, we involved 12 synsets (classes) with the total amount of about 12000 images.

Armed with the two best classifiers, we present a case study with qualitative assessment and the following quantitative assessment of musical instrument recognition in video recordings of musical performances. To identify musical instruments in a video we engage a single-frame model. We uniformly sample a video to image frames and extract synchronized audio excerpts of 3 seconds from the corresponding audio signal. In the qualitative assessment, we obtained the better generalization ability for images than for audio. In contrary, the quantitative assessment demonstrated the better audio-based classification. We also observed similar confusions between instruments for both modalities confirming that visual (shape) and acoustic (timbre) properties of music instruments are related to each other.

Our results are a detailed evaluation of the musical instrument recognition techniques and a distinctive frame-based model for video recordings of musical performances.

Note: This work is presented at Sound and Music Computing Conference 2016 as a full research paper.
Luiza Sayfullina and Mark van Heeswijk. Fixing random weight initialization as the form of regularization.
Abstract: Some of the recent works in Deep Learning community are actively aiming at applying regularization of complex networks through several techniques, like drop-out, drop-connect, maxout and other forms of introducing sparsity. We propose here another regularization approach related to fixing the middle weight matrix from learning in DAE and MLP, that can help to model better noisy datasets and helps to test whether the current complexity of the model is higher/lower than needed.

Let us describe our experimental settings. Our basic denoising auto-encoder (DAE) and MLP model have 2 hidden layers of size 400 and 200. Middle weight matrix W2 is either learned by back-propagation or is fixed by random gaussian initialization with scaling coefficient proposed by Glorot et al. Interestingly, fixing is equivalent to applying Random Projection from one hidden layer to another, that proved to preserve well the distance between the points. Each hidden layer computes the output with relu nonlinearity, except output activation function that is sigmoid for encoder and a softmax function for MLP.

We hold experiments with MNIST dataset and its variations on classification and denoising task with classification error and binary cross-entropy loss functions respectively. We can observe that fixing W2 clearly outperforms in classification task with noisy datasets: MNIST background random/images and MNIST rotation background image. In the denoising task the proposed model outperformed fully trained 2 Layer model on the datasets with no noise: MNIST and MNIST basic. Our intuition is that fixing the weight matrix provides the network an intermediate complexity between the same number of layers of fully learned model (2 hidden layers) and a model with less fully learned layers (1 hidden layer in our case). So fixing weight matrix can be a way of detecting whether the model should be simpler or on the other hand additional complexity is needed. As many possible ways exist to approximate the desired function, fixing the weights with reasonable random weights can help to restrict slightly the space of parameters to search.

For classification task, possibly due to the existence of noise, the presence of fixed weight matrix serves the purpose of modeling the noise with fixed weights instead of trying to learn them. That is why models with noise: MNIST background images and backgroud random have a lower mistake. At the same time, higher complexity was favored for MNIST and MNIST basic datasets, so fully trained two-layered MLP outperformed on them.

It matters which weight matrix is fixed, as fixing the first or the last weight matrices in 2 hidden layer network did not provide any improvements. We tried to replace fixing the weights with drop-out on the same layer, but we did not find the results promising.

To sum up, fixing middle weight representations can serve as a useful way of network regularization and reducing the complexity. We hope to investigate more on that with deeper network architectures by fixing the weights on different layers and fixing the matrices in a domain-specific way.
Paula Gonzalez-Navarro, Mohammad Moghadamfalahi, Murat Akcakaya, Melanie Fried-Oken and Deniz Erdogmus. Bayesian Fusion of Feedback Related Potentials, Event Related Potentials, and Language Models for EEG Based Typing Systems
Abstract: RSVP KeyboardTM is a speller design that has been developed to allow people with severe disabilities to produce expressive language by employing rapid serial visual presentation (RSVP) of letters to detect their intent using Bayesian MAP inference based on event-related potentials (ERP) fused with a (6-gram) symbol language model (LM). During the visual presentation of letters, RSVP KeyboardTM evaluates acquired ERPs in the EEG and assesses their likelihood of being in response to a target letter. While the approach is promising, low separability of class-conditional EEG feature distributions make accurate user intent inference challenging. To increase accuracy, repeated stimuli are often used at the cost of increasing time-to-decision, resulting in lower spelling speed. In past work, we developed an active learning strategy based on a relevant monotonic modular objective that allows the optimal selection of subsets of symbols (letters, space, or backspace) in each RSVP sequence to probe the user brain (which could be considered as a noisy oracle due to uncertain EEG evidence acquired) and improved speed without compromising accuracy. In this paper, we propose using feedback related potential (FRP) evidence to significantly improve accuracy by negligibly impacting speed.
Specifically, we demonstrate that fusing FRP evidence with ERP and n-gram symbol-based language evidence in a Bayesian fashion, by acquiring a single trial worth of FRP evidence at the end of each sequence can offer significant speed-accuracy benefits to the user. To acquire said FRP evidence, we append a prospect symbol (e.g., the top candidate in the alphabet according to the current posterior given all evidence until that point in time) to some or all RSVP sequences. This stimulus, like in regular RSVP trials, induces an EEG response that is indicative of that prospect’s correctness; depending on the instructions given to the user, this FRP may take the form of an error-related potential (ErrP) indicating an incorrect prospect symbol being presented. Unlike early attempts that use ErrP or FRP evidence to make hard decisions, such as deleting the last typed letter if ErrP detection is positive, we use Bayesian fusion of ERP and FRP evidence with a LM using probabilistic generative models. The system, based on current confidence estimated by the updated posterior probability distribution, automatically decides to select a letter to type or proceed with more ERP/FRP evidence collection.
We present experimental results based on data from 12 healthy participants using RSVP KeyboardTM to complete a copy-phrase-task. Three strategies are compared: [NoFRP] uses only ERP/LM fusion; [AlwaysFRP] each RSVP sequence is appended with the top candidate in the alphabet according to posterior after ERP evidence fusion; corresponding FRP is then incorporated; [ConfirmFRP] the top candidate is shown as a prospect to generate FRP evidence only if its posterior exceeds a threshold. Analysis indicates that AlwaysFRP yields significant speed-accuracy benefits for the user.
Daniela Mazza, Angel Juan and Laura Calvet. Urban Mobile Cloud Computing: using learnheuristics for optimizing wireless communication in Smart Cities
Abstract: A smart city is a dynamic environment characterized by the presence of a multitude of smart devices, sensors and processing nodes aiming to distribute intelligence into the city, interacting each other, sharing data, executing processes and competing for the allotment of remote resources.
One of the main challenges, in this context, is to provide solutions able to optimize jointly the distribution of the resources provided for data transfer - tipically wireless heterogeneous networks - and data processing - tipically IaaS, Paas and SaaS cloud resources -
for managing the numerous and complex smart city applications, allowing the citizens community to organize and live as a whole organism.
In order to optimize the resource distribution in such a complex scenario, we proposed a framework, named Urban Mobile Cloud Computing (UMCC) [1], where the Radio Access Technology (RAT) resources are assigned to the Smart Mobile Devices (SMDs) in a typical Combinatorial Optimization Problem with Dynamic Inputs (COPDI), i.e. a problem where the inputs - located in the objective function or in the constraints set - are not fixed in advance, but might vary as the solution is partially built according to some heuristic-based iterative process.
This kind of problem can be solved by the combination of metaheuristics with machine learning methods, in particular introducing the concept of learnheuristics, a novel type of hybrid algorithm.
Firstly, a biased randomization algorithm has been used for finding a sub-optimal solution in a reasonable computation time. The algorithm assigns different probabilities of being selected to each potential movement in the sorted list. In this way, the elements at the top of the list receive more probabilities of being selected than those at the bottom of the list, but potentially all elements could be selected. Secondly, the proposed learheuristics method [2] is applied. Initially, historical data on different system states (e.g. different assignments of SMDs to RATs) and their associated inputs (e.g. SMDs demands observed for the corresponding assignments) are employed to generate machine learning predictive models (e.g., regression models, neuronal network models,etc.). Then, these predictive models are iteratively used during the heuristic-based constructive process in order to obtain updated estimates of the problem inputs (e.g., SMDs demands) as the structure of the solution (e.g., SMDs to RAT assignment map) varies.
The use of the learning mechanism allows the heuristic-based construction process to evolve, taking into account the variations of the inputs due to changes in the solution structure. This approach cannot guarantee optimality, but is more powerful in terms of the size of the instances it can solve in reasonable computing times.
[1] Mazza, D.; Pagès-Bernaus, A; Tarchi, D.; Juan, A.; Corazza, G.E.: "Supporting Mobile Cloud Computing in Smart Cities via Randomized Algorithms," in IEEE Systems Journal , vol.PP, no.99, pp.1-12 doi: 10.1109/JSYST.2016.2578358.
[2] Calvet, L.; De Armas, J.; Masip, D.; Juan, A. (s-160907): “Learnheuristics: hybridizing metaheuristics with machine learning for optimization with dynamic inputs”, in Open Mathematics. ISSN: 18951074.
Alina Selega, Christel Sirocchi, Ira Iosub, Sander Granneman and Guido Sanguinetti. Robust statistical modeling greatly improves sensitivity of high-throughput RNA structure probing experiments
Abstract: RNA structure plays a key role in regulating many mechanisms crucial for correct cellular functioning, such as RNA stability, transcription, and mRNA translation rates. In order to identify RNA structural regulatory elements, chemical and enzymatic structure probing is routinely used to interrogate RNA structure both in vivo and in vitro [1]. In these structure probing experiments, a chemical agent reacts with the RNA molecule in a structure-dependent way, cleaving or otherwise modifying its flexible parts. These modified positions can then be detected by primer extension analyses, providing valuable structural information that can be used to constrain RNA energy-based structure prediction software and significantly improve prediction accuracy [2, 3].

Coupled with high-throughput sequencing, structure probing allows interrogation of thousands of molecules in a single reaction, holding the potential to revolutionise our understanding of the role of RNA structure in regulation of gene expression. However, despite major technological advances, intrinsic noise and high coverage requirements greatly limit the applicability of these techniques. Existing methods [4, 5, 6] do not provide strategies for correcting biases of the technology and are not sufficiently informed by inter-replicate variability in order to perform justifiable statistical assessments.

We developed a probabilistic modelling pipeline which specifically accounts for biological variability and provides automated empirical strategies to correct coverage- and sequence-dependent biases in the data. Our model supports multiple experimental replicates in both control and treatment conditions and computes empirical p-values for each nucleotide by comparing the utilised measure of variability between conditions. These p-values are then used as observations in a Beta-Uniform mixture hidden Markov model, generating posterior probabilities of modification transcriptome-wide as the model's output. This obviates the need for setting arbitrary thresholds and other post-processing.

We demonstrate on two yeast data sets that our method has greatly increased sensitivity, enabling the identification of modified regions on many more transcripts compared with existing pipelines. Our method also provides accurate and confident predictions at much lower coverage levels than those recommended in recent studies [6, 7], which are normally only met for a handful of transcripts in transcriptome-wide experiments. Our results show that statistical modelling greatly extends the scope and potential of transcriptome-wide structure probing experiments.

[1] Kubota et al. "Progress and challenges for chemical probing of RNA structure inside living cells." Nature chemical biology (2015).

[2] Wu et al. "Improved prediction of RNA secondary structure by integrating the free energy model with restraints derived from experimental probing data." Nucleic acids research (2015).

[3] Ouyang et al. "SeqFold: genome-scale reconstruction of RNA secondary structure integrating high-throughput sequencing data." Genome research (2013).

[4] Ding et al. "In vivo genome-wide profiling of RNA secondary structure reveals novel regulatory features." Nature (2014).

[5] Kielpinski et al. "Chapter Six - Reproducible Analysis of Sequencing-Based RNA Structure Probing Data with User-Friendly Tools." Methods in enzymology (2015).

[6] Talkish et al. "Mod-seq: high-throughput sequencing for chemical probing of RNA structure." RNA (2014).

[7] Siegfried et al. "RNA motif discovery by SHAPE and mutational profiling (SHAPE-MaP)." Nature methods (2014).
Corina Gurau, Chi Hay Tong and Ingmar Posner. Predicting Robot Performance based on Past Experience
Abstract: Robot perception is a difficult yet fundamental problem, as robots interact directly with the world and any misconduct can have adverse consequences. While significant effort is being devoted to building highly performant perception systems, the problem of estimating in advance the robot’s capabilities on a visual task is often overlooked. Predicting how robustly a robot’s perception system operates at any given moment in time, across workspaces and operational conditions, can help avoid undesired outcomes.

The framework we propose estimates the robot's object detection performance at test time by leveraging its past performance at the same physical location. The predictions are performed without using any information from the detector and even before running the detection algorithm itself. This work is motivated by our previous observations that perception performance for mobile robots (autonomous vehicles in particular) is environment-dependent. In some places of operation performance is excellent, while in others failure occurs more often [1]. We attribute this partially to the appearance of the environment (weather, illumination conditions, location) which can sometimes be hostile to the detection system employed.

This work introduces performance records: a probabilistic framework used to incorporate place-specific performance estimates gathered over time, which allow the robot at test time to estimate the likelihood of making a mistake. We record the success or failure of the detection system as binary observations and update a Beta distribution for a set of locations in a map to keep track of performance. The records we propose improve with time, traversals of the same location and better selection of the past observations to be incorporated.

Using the probabilistic framework proposed and bayesian decision theory we translate the posterior probability of performance into optimal actions the robot can take along a route. The actions the robot has available include offering autonomy, when it believes that its perception system is functioning well and it can reliably operate autonomously, and denying autonomy, when its perception system is failing and a human operator should take over control. After associating a loss to each event-action pairing, the robot chooses the actions which minimise the expected loss.

Performance records enable the robot to leverage its past experience in order to decide when and where to drive autonomously. Our experiments show that this leads to a lower number of perception mistakes the robot makes along a route. Additionally, setting the loss of a type II error (offering autonomy when performing poorly) high enough leads to the robot never operating autonomously when there is a chance of failure. In robotics, this kind of introspection is vital as it can avoid catastrophic consequences involving the robot and its environment. We believe that performance records represent a step towards reliable vision systems operating in the real world.

This work was presented at ISER 2016. Full text is available at http://www.robots.ox.ac.uk/~mobile/Papers/2016ISER_gurau.pdf.

[1] C. Gurau, J. Hawke, C. H. Tong, and I. Posner, “Learning on the job: Improving robot perception through experience,” in Neural Information Processing Systems (NIPS) Workshop on “Autonomously Learning Robots”, Canada, 2014.
Jessa Bekker, Arthur Choi and Guy Van den Broeck. Learning the Structure of Probabilistic SDDs
Abstract: Many domains, such as health care, gain benefit from machine learning if a certain degree of accuracy is guaranteed about the predictions. For techniques that model uncertainty, such as Bayesian networks and other graphical models, it is in general infeasible to do predictions with such guarantees. While we can measure the accuracy of the model, inference of simple queries such as marginals is #P-hard and therefore the predictions are often approximations with unknown accuracy. The domain of tractable learning provides a solution by restricting the learned models to those that do allow exact inference and therefore the predictions are as accurate as the learned model itself.

The key of tractable learning is the usage of a tractable representation for the model. A tractable representation essentially represents the calculations needed to do inference; the size of this representation is therefore the complexity of inference. By keeping this tractable representation small enough, exact inference will always be feasible. There exist different types of tractable representation that differ in the types of tractable queries, tractable operations (needed during learning) and compactness. Sentential Decision Diagrams (SDDs) support the widest range of tractable queries and operations but pay for this by being less compact [1]. Probabilistic SDDs (PSDDs) can be more compact than SDDs (at least as compact) while supporting the same tractable queries.

SDDs represent boolean formulas. An SDD that represents a probabilistic model has weights for its variables and supports efficient weighted model counting to do predictions. Combining two SDDs is also efficient, therefore any query can be answered by compiling it to an SDD and combining it with the model. PSDDs are a variation on SDDs with parameters on the edges instead of the variables. PSDDs are at least as compact as SDDs because any SDD can be represented by a PSDD but not the other way round. They can answer the same queries as SDDs because they too can be combined with an SDD that represent the query. Therefore PSDDs are an attractive tractable representation, but until now they were not used for tractable learning; a parameter learner does exists [2].

In this work, we present the first structure learner for PSDDs. It starts with an initial model and incrementally changes it to improve the accuracy. For the incremental changes, we introduce some new operators that change the distribution without changing the possible worlds. This method can naturally incorporate constraints by using them as the initial model and is therefore ideal to learn in structured spaces or on top of expert knowledge but it can also handle unconstrained cases.

[1] J. Bekker, J. Davis, A. Choi, A. Darwiche, and G. Van den Broeck. Tractable Learning for Complex Probability Queries. In NIPS, 2015.
[2] A. Choi, G. Van den Broeck, and A. Darwich. Tractable Learning for Structured Probability Spaces: A Case Study in Learning Preference Distributions. In IJCAI, 2015.
Eyrun Eyjolfsdottir, Kristin Branson, Yisong Yue and Pietro Perona. Learning recurrent representations for dynamic behavior modeling
Abstract: We address the problem of labeling sequential data, given an abundance of unlabeled sequences and a small set of labeled actions. In addition to classification, we are interested in models that can predict and simulate motion, providing deeper understanding of the underlying mechanism compared to the superficial notion of classification models. We believe that learning the two tasks simultaneously is useful; motion prediction as an auxiliary task will allow us to train classifiers using fewer labels; labels will help make sure that the motion prediction focuses on actions that may be rare but important. In this paper we propose a semi-supervised approach for the general purpose of studying behavior.

Our model is a recurrent neural network, with long short term memory, that simultaneously classifies actions and predicts future motion of agents (i.e. insects, animals, or humans). The network incorporates elements from a few related fields: Rather than actions being a function of the recurrent state, as is common practice, our model embeds actions in recurrent state units such that the recurrent function encodes action transition probabilities and motion prediction is a direct function of actions, similar to that of a hidden Markov model. The network takes as input an agent's motion and sensory input at every time step, and outputs the agent's next move according to a policy, which is effectively learnt via imitation learning. Similar to autoencoders, our model has a discriminative path, used to embed high level information, and a generative path used to reconstruct the input domain, in our case filling in the future motion. Each discriminative recurrent cell is fully connected with its corresponding generative cell, allowing higher level states to represent higher level information, similar to the idea of Ladder networks [1].

We test our model on two types of data, fruit fly behavior and online handwriting. Both can be thought of as a type of behavior represented in the form of trajectories, but the two are complementary in two important ways: First, flies behave spontaneously, performing actions of interest sporadically and in response to its environment, while handwritten text is intensional and highly structured. Secondly, handwriting varies significantly between different writers in terms of size, speed, slant, and proportions, while inter-fly variation is relatively small.

Our results show that 1) motion prediction is a good auxiliary task for action classification, especially when training labels are scarce, 2) simulated trajectories, obtained by iteratively feeding predictions as input to the network, look convincing and can be manipulated by activating discriminative cell units, and 3) the network, when trained unsupervised, learns to represent high level information, such as gender or identity, at higher levels of the network and low level information, such as velocity, at lower levels of the network.

Visualization of our model and results can be found at http://www.vision.caltech.edu/~eeyjolfs/behavior_modeling/

[1] Harri Valpola. From neural PCA to deep unsupervised learning. Adv. in Independent Component Analysis and Learning Machines, pp. 143-171, 2015.
Musfira Jilani, Padraig Corcoran and Michela Bertolotto. Structured and Unstructured Machine Learning for Crowdsourced Spatial Data
Mapping is an expensive process and obtaining timely and accurate spatial information is a challenging process. In this context, crowdsourced maps such as OpenStreetMap (OSM), have the potential to provide a free and up-to-date representation of our world. However, the proliferation of such maps remains limited due to concerns about their quality. The predominant method for quality assessment of such maps involves comparison with proprietary maps, which in itself is a flawed process. This paper proposes a machine learning based methodology for automatically assessing and improving the quality of crowdsourced maps. Specifically, we have focused on the semantic type quality of streets in the OSM database, where semantic type refers to the class of a street such as motorway, residential road etc.

Domain knowledge suggests that the semantic type information of streets is contained within the various geometrical and topological features of the street network. For example, motorways are usually long and linear whereas residential streets are generally less linear and shorter in length. Therefore, several topological and geometrical features of streets such as length, linearity, number of dead-ends, number of intersections, node degree, and betweenness centrality were extracted.

Preliminary experiments involved developing classical machine learning models such as random forest, SVM, neural networks, etc. However, a street network is a structured input as it consists of several streets, where not only the streets themselves contain information such as geometry, but also the way in which the streets are connected to each other is important. Therefore, for such a structured input, a structured learning of the street types in a street network was performed.

Specifically, the Conditional Random Field (CRF) framework was exploited to perform the structured learning. CRF allowed us to leverage prior knowledge available to us in the form of crowdsourced semantics, the geometrical and topological features of individual streets, and the contextual (structural) relationships between various streets into a single unified model. A max-margin approach was used for learning the model parameters and a fusion moves approach was used for inferring the street labellings.

The developed models were trained and tested on two non-overlapping regions from OSM London database consisting of 19 popular semantic types of streets used in the OSM database. An overall classification accuracy of 55.95% was obtained using the unstructured learning model (random forest). This accuracy increased to 84.75% when structured learning framework was used. Clearly, and naturally the structured learning framework outperforms the unstructured learning performance as it exploits the inherent structure in street networks. To the best of our knowledge, this is the first time that a structured learning framework has been used in the context of crowdsourced spatial data.

Future work involves extending the models developed in this paper (for streets) to other map objects such as buildings, Points of Interests (PoIs), etc.

Julia Olkhovskaya and Vladimir V’yugin. Fixed-Share for specialized experts
Abstract: There are many problems when it is necessary to predict the sequence online. We consider the sequential prediction of arbitrary sequences based on expert advice. Each expert must predict the next element of an unknown sequence given some knowledge about previous elements. At each round of a repeated game of prediction, experts output forecasts, which are combined by an aggregation rule. The true outcome is then revealed and losses, which correspond to prediction errors, are suffered by the aggregation rules and the experts.

Furthermore, in applications, it is often natural for experts to abstain from making predictions on some instances. For instance, in the case of the prediction of electricity consumption, experts can be specialized to winter or to summer, to working days or to public holidays.

Experts make their prediction available to the forecaster before the next outcome is revealed. The forecaster can then make his own prediction depend on the experts’s predictions. The difference between the accumulated loss of forecaster and the loss of the best expert is called regret. It measures how much the forecaster regrets, in hindsight, of not following advice of this particular expert. [1] In case of sleeping experts theoretical guarantees are applied only to those steps, on which the expert was not sleeping.

We propose a modification of a known algorithm Fixed-Share to solve the problem of finding the best combination of expert solutions in the case of sleeping experts. [2] For this modification, we prove equivalence with exponential weighting algorithm, where experts are combinations of base experts, and particularly the definition of the original weights of the combined expert. Loss of expert when it sleeps, is assumed to be a mix-loss.

In the article [3] sleeping experts at each step were specified before running the algorithm. In our work we consider a more general formulation of sets of sleeping experts, that are not set before the start of the algorithm. Instead, they make judgments in real-time. The advantage of this approach is that new experts in this formulation can occur during operation of the algorithm and predictions of experts may not be available at time t, which often happens in practice. The final result is a novel algorithm, which aggregates experts that can abstain from making predictions at any moment and evaluation of regret of this algorithm.

Note: This work is presented at ITaS 2016 as a paper.

1. G. Lugosi, N. Cesa-Bianchi. Prediction, Learning and Games. – New York: Cambridge University Press, 2006.
2. M.Herbster and M.Warmuth. Tracking the best expert. Machine Learning. 1998. 32(2) P.151–178.
3. Marie Devaine, Pierre Gaillard, Yannig Goude, Gilles Stoltz. Forecasting electricity consumption by aggregating specialized experts. Machine Learning. 2013. 90 (2), P.231–260.
Ev a Christakopoulou and George Karypis. Local Item-Item Models For Top-N Recommendation
Abstract: Several algorithms for the top-N recommendation problem have been developed, including approaches that use
latent-space models and approaches that rely on neighborhoods.
The latent space methods factorize the user-item matrix into lower rank user factor and item factor matrices,
which represent both the users and the items in a common latent space. The neighborhood-based methods (user-based
or item-based) identify similar users or items. The latent-based methods have been shown to be superior for
solving the rating prediction problem, whereas the neigborhood methods are shown to be better for the top-N recommendation
problem. Among them, the item-based methods, which include item k-NN and Sparse LInear Methods (SLIM) have been shown to outperform the
user-based schemes for the top-N recommendation task.

However, item-based methods have the drawback of estimating only a single model for all users. In many cases, there are differences in users’ behavior, which cannot be captured by a single model. For example, there could be a pair of items that are extremely similar for a specific user subset, while they have low similarity for another user subset. By using a global model, the similarity between these items will tend to be towards some average value; thus, losing the high correlation of the pair for the first user subset.

In this paper we present a top-N recommendation method that extends the SLIM model in order to capture the differences in the preferences between different user subsets. This work is based on the intuition that not all users behave in the same way – instead there exist subsets of like-minded users. By using different item-item models for
these user subsets, we can capture differences in their preferences and this can lead to improved performance for top-N
recommendations. Our method, which we call GLSLIM (Global and Local SLIM), combines global and local SLIM models in a personalized way and automatically identifies the appropriate user subsets. This is done by solving a joint optimization problem that estimates the different item-item models (global and local), their user-specific combination, and the assignment of the users to these models. Our experimental evaluation shows that GLSLIM improves upon the standard SLIM model and outperforms competing top-N recommendation methods, reaching up to 17% improvement in recommendation quality.

This work has beeen accepted in RecSys 2016 (The ACM Conference on Recommender Systems).
Judy Hoffman, Eric Tzeng, Kate Saenko and Trevor Darrell. Domain Confusion for Unsupervised Adaptation of Deep Visual Recognition Models
Abstract: Consider a group of robots trained by a developer to recognize thousands of common objects using standard image databases, then shipped to households around the country. Each robot will encounter its own unique new environment and consequently its own new bias in object appearances. This shift in appearance bias between a training (source) environment or domain and the deployment (target) domain causes real-world performance to suffer.

It is clear that, given enough extra supervised data from the new target environment, the original performance could be recovered. However, state-of-the-art recognition algorithms rely on high capacity convolutional neural network (CNN) models that require millions of supervised images for initial training. Even the standard approach for adapting deep models, fine-tuning, may require hundreds or thousands of labeled examples for each object category that needs to be adapted.

It’s unreasonable to assume that the robot consumer will be able or even willing to provide the robot with the in-situ annotations required to refine the recognition model using standard supervised techniques. Instead, it is much more realistic that the robot will survey its new environment autonomously, effectively collecting a set of unlabeled visual examples. Therefore, we propose an unsupervised adaptation algorithm for deep visual models. In particular, we introduce a domain confusion loss which optimizes for a domain invariant representations while simultaneously maintaining recognition performance on the labeled source domain.

Our algorithm is inspired by the classic illustration of dataset bias in computer vision by the “name the dataset” game of Torralba and Efros (CVPR 2011). This work trained a classifier to predict which dataset an image originates from, thereby showing that visual datasets are biased samples of the visual world. Indeed, this turns out to be formally connected to measures of domain discrepancy (i.e. a-distance and max mean discrepancy). We show that our technique enables adaptation by directly updating the visual representation parameters so as to minimize the distance between the empirical distributions of the two domains, which in turn minimizes the error of the source domain recognition model when applied to the target domain images.

We empirically evaluate our algorithm on two standard visual adaptation datasets and demonstrate successful adaptation performance for the object classification task. For full experiments see [1]. Our follow-up work explores adaptation with domain confusion along with paired instance constraints for simulated to real pose prediction tasks for robotic vision [2]. Finally, we have ongoing work extending our deep architecture for unsupervised adaptation of semantic segmentation models with application to self-driving domains.

[1] Eric Tzeng*, Judy Hoffman*, Trevor Darrell and Kate Saenko. “Simultaneous Deep Transfer Across Domains and Tasks”, ICCV 2015. *Equal Contribution.

[2] Eric Tzeng, Coline Devin, Judy Hoffman, Chelsea Finn, Xingchao Peng, Sergey Levine, Kate Saenko, Trevor Darrell. “Towards Adapting Deep Visuomotor Representations from Simulated to Real Environments”, http://arxiv.org/abs/1511.07111, 2016.
Samaneh Khoshrou and Jaime S. Cardoso. Towards Never-Ending Learning from Evolving Video Streams
Abstract: With the advent of video surveillance networks, making sense of an ever-increasing amount of visual data is a growing desire and need in many applications. The automatic (re-)identification of individuals across camera views, which is known as person re-identification (ReID), has been one of the most challenging problems in the field of video analysis. Current algorithms assume that a one-time training session in where labelled observations of all the classes to be observed are available. Furthermore, the visual streams are recorded during a limited time frame. These assumptions are a good approximation to some of the real-life problems, however, in many applications some of the previous assumptions are violated, rendering the conventional algorithms sub-optimal or impractical.

In this work, we look at the problem of human re-identification from a time-series mining perspective where visual streams are endlessly generated and initially we only have a vague knowledge of what concepts can be observed during the surveillance period. We formalize a learning concept suitable for multi-camera video surveillance and propose a learning methodology adapted to that new paradigm.

We build a framework that resorts to evolving approaches, specifically ensembles, to track the environment online, and adapt the knowledge to accommodate the environment changes accordingly. The framework is also designed to exploit active learning strategies, in order to interact wisely with oracles, requesting assistance in more ambiguous to classify observations.

Additionally, a spatial-temporal fusion approach is employed to accurately combine information from Region of Interest (RoI) batches captured in a multi-camera surveillance scenario.

Extensive experiments demonstrate the advantages of the proposed framework in terms of accuracy and complexity over several baseline approaches on multiple datasets.
Jing Yuan, Andreas Lommatzsch and Sahin Albayrak. Contextual Factors Involvement for Score Modeling under LambdaRank in Recommender Systems
Abstract: For most recommender systems, the ultimate task is to generate a ranked item list for a user under specific context environment. The basis for sorting such an item list is the score yielded by the model regarding features of user, item and context respectively. Loss/utility definition on the whole list and score modeling are then the two main components for learning to rank (L2R) in recommender systems.

L2R is a hot research area in Information Retrieval (IR) community. In order to apply L2R in recommender scenarios, we use the user-context combination to replace the explicit query in IR. On the other hand, LambdaRank, a powerful L2R framework that combines gradient on pair-wise loss and specific evaluation metrics as utility functions for learning, is chosen to wrapping up the scoring models.

In this work, we extend the SVDFeature framework[1], which has won the 2012 KDD Cup Track1[2] and has implemented LambdaRank. Our scoring model is developed tailored to the consideration of contextual factors within this framework. Thanks to the natural factorization effect of LambdaRank, we can separate the weights updating direction calculation to partial differential of loss on score and partial differential of score on weight[3]. Thus more attention can be paid on the score modeling while ranking loss does not need to be cared about anymore. As to score modeling, aside from users and items, contextual factors (e.g. time, location, company by friends) are also influential auxiliary information we need to consider for a recommender. Two modern score modeling strategies, Tensor Factorization (TF) and Gradient Boosted Regression Trees (GBRT) are chosen to involve contextual factors in the score modeling in recommender. Compared to the complexity of learning tree structure in function space of GBRT, TF has the advantage of conveniently calculating derivatives following Matrix Factorization thought. Meanwhile, features belonging to these contextual factors can be in different formats, either as discrete categorical forms or as continuous values. In the view of this point, TF is more bound to discrete format data input, yet GBRT holds its strength as flexibly splitting on feature values.

There have already been successful attempts of TF and LambdaMART (additive trees in LambdaRank) in recommender strategies. However, to the best of our knowledge, this is the first time that tensor-based approach and additive trees are implemented in the same LambdaRank framework.

The experiment is conducted on datasets (TV1 and TV2) that have been collected by two IP-television providers in Europe. The two providers have 200,000 and 600,000 television subscribers respectively. For the fairness of comparing the scoring models, we apply the same contextual factors, “time of day” and “day of week” for each algorithm. The experiment illustrates the performance on Mean Average Precision (MAP) when applying TF and GBRT towards user-context query unit in LambdaRank. The work also aims to inspire other model solutions within L2R framework to cope with missions in recommender systems.
Sabrina Rossberger, Carsten Haubold and Heike Boehm. Automated segmentation, tracking and quantitative image analyses of collective cellular migration
Abstract: Collective migration describes the coordinated movement of larger groups of individuals and can be observed in various different systems. Popular examples for collective migration can be found in schools of fish, malaria parasite movement or skin cells during wound healing. Automated and quantitative image analysis tools are essential to reveal fundamental principles and correlations underlying the coordinated movement.

In our project, we investigate the temporal motion of cells in 2D- and 3D-collectives to study highly directed interactions and unique motility features on single cell level. To extract unbiased information describing these highly complex systems we establish a custom designed image analysis pipeline, which covers the automated segmentation, tracking and analysis of large time-resolved in vitro fluorescence microscopy image sequences. The pipeline allows characterizing the highly dynamic behavior of single cells within a very complex system.

Image segmentation is conducted using the open source software tool ‘ilastik’ based on supervised machine learning procedures [1]. Here, a pixel classifier using a Random Forest is trained by providing sparse data annotations. The generated probability map is transformed into a final segmentation, which is used for the following object (cell) tracking. Trajectories of cells care reconstructed finding the most probable configuration of a tracking-by-assignment model as in ‘ilastik’ [2].
Afterwards, cells are subsequently analyzed regarding well-defined biophysical features and quantities (e.g. number of cells, speed and characteristic trajectories) by applying a custom developed and user-friendly bioinformatics tool. Furthermore, advanced characteristics such as next neighbor behavior and the importance of their relative position within the collective can be studied by machine learning tools. Spatial next neighbor relations can be modeled as graphs to find (affine) transformations between consecutive images in order to show correlated behavior. Moreover, cells can be affiliated to characteristic cell accumulations by applying clustering algorithms.

The goal is to determine the critical parameters necessary for the appearance of collective motion and maintaining the integrity of the collective. This will improve our understanding of the underlying mechanism and dynamics of collective cell behavior, which will give new insights into e.g. the complex healing process of skin wounds or the mechanism underlying malaria parasite movement for revealing potential weak spots to target in drug development.

[1] C. Sommer, C. Strähle, U. Köthe, F. A. Hamprecht: ilastik: Interactive Learning and Segmentation Toolkit. Eighth IEEE International Symposium on Biomedical Imaging (ISBI). Proceedings, (2011), 230-233
[2] M. Schiegg, P. Hanslovsky, B. X. Kausler, L. Hufnagel, F. A. Hamprecht: Conservation Tracking. ICCV 2013. Proceedings, (2013), 2928-2935
Jovana Mitrovic, Dino Sejdinovic and Yee Whye Teh. Disentangling the Factors of Variation at Initialization In Neural Networks
Abstract: While deep learning methods have achieved state-of-the-art results on many problems in computer vision, natural language processing and speech recognition, training them end-to-end still remains an unsolved problem. In particular, the selection of hyperparameters related to optimization, e.g. learning rate, as well as the initial values of the network weights critically influence the success of the training procedure and thus the generalization performance of the network. In this paper, we introduce a novel data-driven weight initialization scheme. Contrary to most initialization methods, our approach is not motivated by considerations about the numerical stability of the training process, but by incorporating the structural properties of the data into the initial weight values. The main idea of our method is to aid the learning process by initializing the network weights such that the initial hidden representations accurately capture the data manifold and disentangle the underlying factors of variation. In particular, we initialize the weights to be aligned with subspaces spanned by subsets of the data. This ensures that the intermediate hidden representations preserve the structure of the data manifold as they represent inner products between the selected subspaces and the data. We present two approaches for the selection of the needed data subsets. The first method is based on the idea of projecting onto a random data subset drawn according to the available labels, while the second method uses Fisher discriminant analysis to identify the appropriate subspaces. As a result of this construction approach, the units within a hidden layer can be identified as feature detectors and, depending on the chosen subsets, groups of neurons can act as class detectors. Experimental results show that networks initialized with our approach are able to capture the data manifold fairly accurately even without any training. Furthermore, we tested our initialization approach against standard initialization methods on MNIST and CIFAR-10 and achieved
superior or highly competitive results. To the best of our knowledge this is the first weight initialization method for neural networks that is supervised and disentangles the underlying factors of variation in the data prior to training.
Melanie Fernandez Pradier and Fernando Perez-Cruz. Case-Control Indian Buffet Process for Biomarker Discovery in Clinical Trials
Abstract: Biomarker discovery in clinical trials is of the utmost interest to understand both disease mechanisms and drug effects on patients. It is particularly useful for prediction of disease progression, prescription of appropriated drugs, or even as potential targets in the development of new drugs, tailored to patient characteristics. Ultimately, we want to identify two types of biomarkers. Prognostic ones are clinical or biological characteristics that provide information on the likely outcome of the disease for a given patient, regardless of any treatment. In contrast, predictive biomarkers provide information on the likely benefit from treatment, e.g., in terms of tumor shrinkage or survival.

Traditional approaches for screening potentially interesting biomarkers typically suffer from at least one of the following: (i) population heterogeneity: the fact that each individual has unique characteristics - demographics, environment or biological aspects - resulting in different disease progressions and drug responses, (ii) correlation among biomarkers that might difficult their detection if screened individually, and (iii) difficulty to isolate drug effects from natural responses, specially in small sample-size scenarios. As a consequence, it is very challenging to personalize treatments, or develop effective drugs. Indeed, most major drugs are known to be effective in only 25 to 60 percent of patients, and more than 2 million cases of adverse drug reactions occur annually in the United States, including 100,000 deaths.

This work presents a Bayesian non-parametric approach for biomarker discovery and sub-population characterisation in clinical trials. Our method identifies both prognostic and predictive variables globally and specifically for each sub-population, as well as quantify the direction of action, effect size and statistical significance for each biomarker. It handles noise and missing information naturally, as well as heterogeneous data. The number of sub-populations and sets of relevant biomarkers are learned automatically. Our model extends the general Indian Buffet Process (IBP), a Bayesian non-parametric prior for infinite latent feature models that can handle mixed continuous and discrete data. The IBP has already been used in biological applications involving gene expression data, or genetic tumor variation. To the extend of our knowledge, this is the first time to be applied in clinical trials.

In order to deal with small sample-size settings, we share information among patients in a structured way, which makes the inferred features easier to interpret. We define two types of features: global features that can be active for any patient, and drug-specific features that can only activate for patients having taken the drug. Thus, there is a clear separation between drug effects and natural prognostic factors. Our methodology is robust, as it combines bootstrapping techniques with soft feature assignments, together with assessment of statistical significance and control for multiple-hypothesis testing. We demonstrate the usefulness of our novel approach on a real database, e.g., a randomized phase II case-control study of a cutting-edge immunotherapy treatment against liver cancer. Not only our method finds already well-known relevant biomarkers, it also discover new ones that could not be found with previous methods, both at a global and sub-population level.
Li-Fang Cheng, Gregory Darnell, Corey Chivers, Michael Draugelis, Kai Li and Barbara Engelhardt. MedGP: A Sparse Gaussian Process Framework for Multi-Covariate Medical Time-Series Analysis
Abstract: In real-time monitoring of hospital patients, high-quality inference of patient status incorporation observations across patient demographics, clinical covariates, and lab tests is essential to enable successful medical interventions and to improve patient outcome. The first challenge is to smooth and predict the trend of clinical covariates with high missing rate in an online way. Recent studies have shown that with more accurate smoothing and prediction of missing data, it is possible to extract subtype of diseases or increase the accuracy of clinical event prediction (e.g. onset of sepsis). In this work, we propose to enable personalized, real-time prediction of covariates of interest for new patients through learning relationships among clinical covariates from a large set of reference patients. We develop a statistical framework based on Gaussian process (GP) regression (MedGP) that estimates the temporal correlations within and between clinical covariates from sparse heterogeneous medical time series data and predicts the values of vital signs and lab results with customization.

MedGP is effective on these sparse and noisy data because of three essential components. First, we developed a mixture of flexible GP kernels along with cross-covariate Gram matrix to encode covariate-specific temporal correlations that may be shared across multiple covariates time series, such as smooth long-term trends or 24 hour periodicity. Next, we designed a sparse prior to regularize the covariance structure and prevent overfitting. We used reference patients to estimate an empirical prior distribution for the GP kernels, and we estimated a population-level prior kernel using nonparametric density estimates. Finally, when applying the trained model to a new patient, we updated the posterior distributions of the GP kernels on-the-fly based on new observations to personalize predictions to each patient. To enable tractable computation, we derived stochastic variational inference (SVI) to enable robust predictions in an online, hospital setting.

We evaluated MedGP using a subset of data from the University of Pennsylvania Hospital. The data subset includes the medical time-series from over 6,000 patients in three disease subgroups; in total over four million covariate observations are tested. We ran our method with 24 clinical covariates on the three patient subgroups separately, and we compared prediction errors with baseline methods, where the cross-covariate relationships were not used. Our results show that MedGP is able to capture interpretable relationships among clinical covariates. For instance, we found one of the base kernels is a periodic kernel with period close to 24 hours, indicating a circadian rhythm for those clinical covariates. We also found that MedGP produces significantly smaller prediction errors when used for predicting covariates that are well-correlated with other covariates, which is not possible in the univariate baseline methods. In summary, the proposed framework enables tractable fine-grained prediction of temporal trends in complex and heterogeneous time series medical data by exploiting observable dependencies among clinical covariates.
Genna Gliner, Yoson Park, Christopher Brown and Barbara Engelhardt. Identifying regulatory variants in large-scale genetic expression studies with zero-inflated Poisson linear models
Abstract: Studying the relationship between genetic regulatory elements and phenotypic expression in high-dimensional genetic data has proven to be a powerful method to investigate the regulatory mechanisms of complex human traits. In high-throughput genetic data, there are many sources of variation that confound the identification of the true biological signal.

In this work we propose a statistical linear regression framework that addresses the confounding factors inherent in the data, such as artifactual zeros in the response variable and correlation between the predictors, and identifies potential regulatory elements. We use a Bayesian approach and apply a zero inflated Poisson linear model (ZIPLM) to identify regulatory genetic variants driving the differential expression levels using genotype data as predictors and RNA-seq data as response. The ZIPLM is governed by two zero generating processes that distinguish structural zeros caused by confounding factors from sampling zeros that occur from biological drivers of variation. We parameterize the generating process governing the sampling zeros with a Poisson generalized linear regression model where potential regulatory variants serve as the predictors to the model. We link the predictors in the Poisson model through a Gaussian Process to account for correlation between the predictors due to factors such as population structure and identify regulatory elements based on the estimated regression coefficients.

The model is fit using a variational bayes routine. We assess our method on the problem of detecting the true biological signal regulating differential expression of protein-coding genetic variants across alleles, a phenomenon known as allele specific expression (ASE). We assess our method through simulation studies and illustrate the advantages of this method over conventional allele specific association studies. We present the results of our work applied to the Genotype-Tissue Expression (GTEx) v6 projecting containing samples from 449 individuals, where we identify patterns of allele specific expression across the genome in 28 tissues.
Tania Lorido Botran. Anomaly detection in Virtual Machine performance with Dirichlet Process Gaussian Mixture Models
Abstract: Virtualized data centers rely on the use of Virtual Machines (VMs) and/or containers to run applications. Many factor can cause anomalies in VM performance. In a cloud environment, several VMs can be co-located within a server which may lead to interference among them, and consequently to disturbances in their performance due to their competition for accessing shared resources, such as cache or memory.

We introduce a novel methodology to detect this type of anomalies in VM performance. Intuitively, we can create a model or profile of the resource usage at different times. Then, we can compare these models and quantify the dis-similarity, and finally, raise an alarm when an anomaly is identified.

We have selected DPGMM (Dirichlet Process Gaussian Mixture Models) as a way of modeling VM resource consumption. The model represents a profile of the resource usage over a certain period of time. Actually, a profile is defined as a clustering or combination of components with given means, standard deviations and weights.

The proposed methodology relies on comparing profiles or clusterings from different periods. Clustering comparison has been used for different tasks, e.g. image retrieval. We have adapted a commonly used distance metric named Kullback-Leibler to compare GMM-based models. In general, we interpret an anomaly as a short-term change (high distance value) between two clusterings, while the long-term resource usage remains constant (low distance value).

We have tested the algorithm on a custom testbed, using an application running on top of Docker containers. We have simulated anomalies by limiting the resource cap (share) to a given container for a certain period of time, and by varying the workload rate. The proposed algorithm is able not only to adapt itself to the dynamic nature of application performance (workload variations, or a change in the VM configuration), but also to capture relationships (correlations) between different resources (e.g. low CPU usage, high disk I/O rate).
Azam Hamidinekoo, Erika R.E. Denton and Reyer Zwiggelaar. Breast Cancer Prediction and Phenotyping based on Mammographic and Histologic Data
Abstract: Considering the importance of breast cancer among women worldwide, in this research, a promising new paradigm in computerised cancer research is introduced. We investigate the relationship between various imaging approaches which relies on defining a relationship between different data sets across a range of spatial and temporal scales.
The main aim of the research project is to develop a mapping model of features/phenotypes between mammographic abnormalities (low dose, X-ray images) [1] and their histological representation (stained tissue sample images) [2]. To achieve this, we will automatically extract morphological/appearance features from mammographic and histological images and build a map between these based on a large dataset.
The initial phase within the project represent the development of automatic Computer Aided Diagnosis (CAD) systems for the mammography and histology data, which will use modern deep learning techniques (e.g. Convolutional Neural Nets (CNNs)) to determine a set of mammographic (MP) and histological (HP) phenotypes. The development of such models is based on minimising the difference between the NHS-BIRADS labels [3] and NHSBSP labels[4] provided by the expert radiologists and histopathologists respectively and those predicted by the model. The aim of the training stage is to estimate the CNN parameters of the mammographic and histologic models and the features MP and HP which are used to obtain the optimal classification results. The resulting models can be used to predict the NHS-BIRADS/NHSBSP labels directly from mammographic and histological image data respectively.
Once we have estimated the mammographic and histological models, we can use those models to generate patient matched mammographic and histological features/phenotypes, and their relationship can be estimated by developing a linking map (M-LM-H) based on deep learning techniques to estimate the relationships between the mammographic and histological phenotypes. The estimation is based on dataset {1..N}, which contains mammographic abnormal regions and histological samples representing those abnormal regions.
The final stage of the development is to use unseen mammographic cases to predict the histological classification.
The result of the research is to link appearance of mammographic abnormalities to specific histological information and to predict how the micro-biological changes are reflected in macro-images or whether such reflection can be found. This might be further extended to cover various ethnic populations and links to breast cancer pathways. The results can also help to avoid the need for further biopsy when the abnormality is deemed benign and suggests that this association maps will contribute to clinical decision making and treatment management.

[1]Oliver, Arnau, et al. "A review of automatic mass detection and segmentation in mammographic images." Medical image analysis 14.2(2010):87-110.
[2]Gurcan, Metin N., et al. "Histopathological image analysis: A review." IEEE reviews in biomedical engineering 2(2009):147-171.
[3]Wilson, Robin, et al. "Clinical guidelines for breast cancer screening assessment." NHSBSP publication 49(2001):l.
[4]Ellis, I. O., et al. "Pathology reporting of breast disease." NHS publication58(2005).
Corina Florescu and Cornelia Caragea. An Unsupervised Algorithm for Keyphrase Extraction
Abstract: The large and growing amounts of online textual data present both challenges and opportunities to enhance knowledge discovery. One important challenge is to automatically extract a small set of keyphrases from a document that accurately describe the document's content and that allow for fast information processing. Approaches for automatic keyphrase extraction have been proposed in the literature along two lines of research: supervised and unsupervised.

Despite that supervised approaches typically work better than unsupervised approaches, the requirement for large corpora with human-annotated keyphrases has led to significant attention towards the design of unsupervised approaches. Many features used to encode a candidate phrase in the supervised line of research have influenced the progress of the unsupervised approaches. For example, term frequency – inverse document frequency (tf-idf) of a word has been shown to be a very effective feature in supervised keyphrase extraction. Ranking terms based on tf-idf scores in unsupervised methods perform very well, often rivaling with graph-based methods. However, tf-idf scoring computes the importance of a word based on its prevalence in the document (and its idf), independent of other words in a document.

Graph-based ranking approaches compute the importance of a word based on how related it is to the other words in a document [1]. More precisely, given a document, a words graph is built, where nodes correspond to words and edges are drawn between words with ``meaningful'' relations. The score of a word is obtained by employing a graph-based ranking algorithm such as HITS or PageRank.

In supervised models, another remarkably informative feature is the relative position of the first occurrence of a phrase in a document (or its normalized distance from the beginning of a document). This feature is shown to be one of the top ranked features by Information Gain. However, its strength has not yet been exploited in unsupervised approaches.

To address this problem, we propose PositionRank, a novel graph-based unsupervised approach, which aims to capture both the position of words and their frequency in a document into a position-biased PageRank algorithm. For each word, we compute a compositional weight which assembles the position of a word together with its frequency in the document. The idea of PositionRank is to assign larger weights to those words that appear early and frequently in the document, then to compute a PageRank score biased towards the words having assigned higher weights.

We experimentally validate PositionRank and show statistically significant improvements over five state-of-the-art models for keyphrase extraction, in terms of Precision, Recall, F1-score and MRR. Our experiments on three datasets of research papers show that our model achieves better performance than strong baselines: TextRank, SingleRank, ExpandRank, and TopicalPageRank [1-3] with relative improvements in performance as high as 29.6%.


[1] Rada Mihalcea and Paul Tarau. TextRank: Bringing Order into Text. In EMNLP’04.

[2] Xiaojun Wan and Jianguo Xiaoi. Single Document Keyphrase Extraction Using Neighborhood Knowledge. In AAAI’08.

[3] Liu, Zhiyuan and Huang, Wenyi and Zheng, Yabin and Sun, Maosong. Automatic keyphrase extraction via topic decomposition. In ACL’10.
Ruth C. Fong, Walter J. Scheirer and David D. Cox. Using human brain activity to guide machine learning
Abstract: While the human brain has long served as a source of inspiration for machine learning, little effort has been made to directly use data collected from working brains as a guide for machine learning algorithms. Here we demonstrate a new paradigm of “neurally-weighted” machine learning, which takes fMRI measurements of human brain activity from subjects viewing images, and infuses these data into the training process of an object recognition learning algorithm to make it more consistent with the human brain. After training, these neurally-weighted classifiers are able to classify images without requiring any additional neural data.

Specifically, we present a modified formulation of the loss function for support vector machine (SVM) classifiers that proportionally penalizes training examples based on a "perceptual annotation score". This score is the Platt probability output from another SVM trained on voxel activity alone and captures the ease with which the human brain identifies objects in images. Thus, the neurally-weighted SVM yields a decision boundary that better matches that of humans by prioritizing "strong" examples as identified by the human brain.

We show that our neural-weighting approach can lead to large performance gains when used with traditional machine vision features, as well as to significant improvements with already high-performing convolutional neural network features trained on ImageNet. The effectiveness of this approach points to a path forward for a new class of hybrid, "biologically-informed" machine learning algorithms which take both inspiration and direct constraints from neuronal data.
Ashwini Tonge and Cornelia Caragea. DeepPrivate: Accurate Prediction of Images’ Privacy on the Web
Abstract: Online image sharing on social media sites such as Facebook, Flickr, and Instagram can lead to unwanted disclosure and privacy violations when privacy settings are used inappropriately. With the exponential increase in the number of online shared images every day, the development of effective and efficient prediction methods for image privacy settings are highly needed. A rule-based classifier that classifies images as private if they contain people does not work well in a real-world scenario. For example, a public image of a tennis player in a tournament will falsely identify as private by the rule-based classifier. Several studies have started to explore image privacy classification using tags and content features such as SIFT (or Scale Invariant Feature Transform) or RGB [1,2]. Lately, Convolutional Neural Networks (CNNs) work remarkably well in many visual content analysis tasks such as generating short descriptions of images and videos.

We propose an approach to image privacy prediction, called DeepPrivate which first performs image content analysis using CNNs to extract deep visual features and deep tags, and then uses Support Vector Machines (SVMs) on these feature representations to predict the class of an image as public or private. This task will aid social networking sites to flag images with sensitive content and has great potential to protect online users’ privacy. Since not all images on social networking sites have user tags, we used an automatic technique to annotate images with tags based on their visual content and called them deep tags. For deep features, we explored various layers of two network architectures AlexNet and GoogLeNet, which are the winners of the ImageNet visual object recognition challenge.

We show empirically on a real world Flickr dataset that DeepPrivate, which combines the strengths of CNNs and SVMs, outperforms: (1) existing state-of-the-art models which use SIFT and RGB [1,2]; and (2) a rule-based learner that predicts an image as private if it contains people. We also examined user tags and deep tags independently and in combination. The tag analysis provides a brief outline of the tags associated with public and private images. Learning curves and precision-recall curves show that DeepPrivate provides the better-performing model with a limited amount of training data and obtains high precision for high recall compared to baselines. The results indicate remarkable performance on image privacy prediction when using DeepPrivate with various deep features. Compared with previous approaches on this task and with rule-based models, DeepPrivate yields 10% improvement in performance over all these models. Moreover, we achieve good performance simply using visual features and hence, eliminates the need of user tags that require human annotation. Hence, we demonstrated that simple rule-based models or models with content features such as SIFT or RGB would not suffice for this task and advanced AI technology for image content analysis such as deep learning is required.

[1] Sergej Zerr, Stefan Siersdorfer, Jonathon Hare, and Elena Demidova. Privacy-aware image classification and search. SIGIR’12.
[2] Anna C. Squicciarini, Cornelia Caragea, and Rahul Balakavi. Analyzing images’ privacy for the modern web. HT’14.
Maria Panteli. Automatic tagging in world music collections
Abstract: We are interested in studying music similarity within a corpus of world music recordings. The task of modeling music similarity is subjective. When no annotations are available, music metadata can be used to train models of music similarity. For example, algorithms can be trained to predict music as similar whenever music tracks share tags of genre, mood, or artist. Music similarity algorithms in past research have focused mainly on Western music and the tags associated with it, and little has been done on non-Western music. We aim to investigate content-based approaches for automatic tagging in world music for the purpose of studying music similarity in a large corpus later on.

The challenges associated with the analysis of world music result from, amongst others, the degraded audio quality and, with respect to the metadata, the inconsistent taxonomy of world music genres. Large world music archives usually contain material recorded under a variety of recording conditions, e.g., field recordings, and are preserved to different degrees of fidelity. This imposes challenges in the content-based analysis of the audio signal. On the other hand, metadata associated with such collections are usually limited to spatio-temporal information such as the country and year of the recording and cultural background of the performers for example the language and ethnic group or culture. Therefore, the tags describing world music are different from the tags we encounter in Western music and pre-trained models for automatic tagging are not suitable for world music tagging. What is more, world music archives combine the work of many individual collectors and the archive as a whole often suffers from incomplete or inconsistent metadata.

We investigate content-based tagging techniques to overcome the problem of unlabeled data in world music collections. We represent sound recordings by Mel spectrograms and use a multi-layer Convolutional Neural Network (CNN) to learn high-level features. Mel spectrograms and CNNs have been successfully applied in content-based music models. For the purpose of automatic tagging the output of the network is treated as a multi-label classifier. Labels in this case consist of world music tags such as the country, language and culture of a recording. The output of classification indicates which labels (between none or all) are most suited for each recording. We investigate further classification instances with rare outcomes (for example the case where none of the labels were matched) to decide whether a recording denotes a new tag, e.g., a new country, language or culture not included in our training set. The discovery of new tags in this manner can have a significant impact in the musicological analysis of world music collections. Automatic tagging facilitates further computational analysis and enables the exploratory study of music similarity in a large corpus of world music.
Laura Florescu and Will Perkins. Spectral thresholds in the bipartite stochastic block model
Abstract: We consider a bipartite stochastic block model on vertex sets $V_1$ and $V_2$ of size $n_1$ and $n_2$ respectively, with planted partitions in each, and ask at what densities can spectral algorithms recover the partition of the smaller vertex set. The model was used in \cite{feldman2014algorithm} to give a unified algorithm for random planted hypergraph partitioning and planted random $k$-SAT.

When $n_2 \gg n_1$, multiple thresholds emerge. We show that the singular vectors of the rectangular adjacency matrix exhibit a localization / delocalization phase transition at edge density $p = \tilde \Theta(n_1^{-2/3} n_2^{-1/3})$, giving recovery above the threshold and no recovery below. Nevertheless, we propose a simple spectral algorithm, Diagonal Deletion SVD, which recovers the partition at density $p = \tilde \Theta(n_1^{-1/2} n_2^{-1/2})$.

Finally, we locate a sharp threshold for detection of the partition, in the sense of the results of Mossel, Neeman, Sly and Massouli\'e for the stochastic block model. This gives the best known bounds for efficient recovery densities in planted $k$-SAT and hypergraph partitioning as well as showing a barrier to further improvement via the reduction to the bipartite block model.
Aline Villavicencio, Silvio Cordeiro, Carlos Ramisch and Marco Idiart. Identifying Idiomatic Language with Distributional Semantic Models
Abstract: Precise natural language understanding requires adequate treatments both of single words and of larger units. However, expressions like compound nouns may display idiomaticity, and while a police car is a car used by the police, a loan shark is not a fish that can be borrowed. Therefore it is important to identify which expressions are idiomatic, and which are not, as the latter can be interpreted from a combination of the meanings of their component words while the former cannot. This abstract discusses the ability of distributional semantic models (DSMs) to capture idiomaticity in compounds, by means of a large-scale multilingual evaluation of DSMs in French and English.

DSMs use context information obtained from sentences in a large corpus to represent the meaning of lexical units as vectors in semantic space, and proximity in this space is assumed to indicate semantic relatedness. However, DSMs are often evaluated on artificial similarity datasets containing single words or compositional phrases. To address this point, evaluation is done on a balanced multilingual dataset of idiomatic, partly and fully compositional compounds. Moreover, the ability of DSMs for accurately capturing semantic information may be affected by a number of factors involved in constructing the models like the size of the input corpus [1] , the size of context used to build the model [2] as well as the number of dimensions used to represent the context [3]. In this abstract, we discuss the impact of these 3 variables: the level of corpus preprocessing for two languages, the size of the context window and the number of dimensions for context representation. Additionally, this analysis includes standard DSMs based on positive Pointwise Mutual Information (PPMI) as well as widely used word embedding such as word2vec [4], and GloVe [5].

A total of 816 DSMs were constructed in 2,856 evaluations. The results obtained show a high correlation with human judgments about compound idiomaticity (Spearman’s ρ=.82 in one dataset), indicating that these models are able to successfully detect idiomaticity. For the number of dimensions across languages, larger values often bring better performance. Likewise, more preprocessing (lemmatization and stopword removal) leads to better results. Optimal window size depends on the model and language, but for the best models, a smaller window outperforms larger windows. This may be a consequence of the linear decay context weighting process, which assigns higher weights to closer words as context window size increases.

As an overall conclusion, in combination with a large dimension and a small window size, investing in preprocessing provides a good balance of a smaller vocabulary and good accuracy. This is especially clear for a morphologically richer language like French, where lemmatization is homogeneously better for all models, even for those in which surface forms performed better in English. We hypothesise that this is because preprocessing decreases data sparseness. Regarding the models, PPMI based DSMs and word2vec alternated in providing the best results, indicating that traditional models perform comparably to newer word embeddings, contrary to recent findings [6].
Victoria Ashley Villar. Estimating Supernova Metallicities Using Neural Networks
Abstract: Type Ia supernovae (SNe) are the explosive deaths of white dwarf stars which can be seen giga-lightyears away. Although the exact mechanics of supernovae ignition is poorly understood, these luminous events have been extremely useful for understanding our universe. In particular, Type Ia SNe can be used as so-called “standard candles" to measure distances. These distances were recently used to demonstrate that our universe is expanding at an accelerated rate (Riess et al. 1998). Type Ia SNe can be used as standard candles because they are a fairly homogeneous class of objects; however, small variability in these events tell us important properties about stellar lives and evolution. One source of such variance is metallicity, or the amount of massive atoms, in the white dwarf star before its explosion. The effect of metallicity on the observed supernova properties is currently unclear because we cannot
directly measure the metallicity of most supernovae progenitor stars. These effects might have a systematic effect on our understanding of cosmology because the early universe contained less heavy elements than today's universe. An estimator of metallicity can help both to reduce the uncertainty on the history of our universe and to guide stellar models.

Inspired by the success of neural networks in the field of image processing, we aim to estimate the metallicities of the supernovae progenitors from their post-explosion spectra using feed-forward neural networks. This project uses two key pieces of data: down-sampled supernovae spectra (the 1D input data array) and the progenitor metallicity (the scalar target variable). The application of neural networks to physical spectra is a largely unexplored scientific problem. In our particular case, SNe spectra are similar to images in the sense that important features are often spatially centralized. However, unlike images, these features often have a characteristic width (determined by physical properties of the supernova). Only the strength or position of these feature likely matters when performing regression.

Our preliminary results are promising. Compared to results from complex physical models, our root-mean-square errors (RMSE) of the predicted progenitor metallicities by our multilayer perceptron are substantially smaller. The scatter of these RMSE values is approaching the statistical uncertainty expected within our sample of SNe. From this analysis, we are also find that the blue end of the spectrum heavily correlates with metallicity, a result that is consistent with the “line-blanketing” properties of iron and other heavy metals. The results of our experiments show that MLPs have potential for estimating SNe metallicities from their spectra.
Julia Spinelli. Discovering Groups of Brains with Similar Spatial Patterns of Correlated Regions
Abstract: The brain is the body’s information processing unit that controls thoughts and actions. How the brain completes all of these tasks is still a very active area of research. Numerous hypotheses about the roles of brain regions exist, assigning cognitive tasks to regions. Recent works have shown that most complex tasks require coordinated engagement among many regions, which are called a network. Functional networks are defined as groups of regions correlated above a threshold level. This study focuses on learning whether every brain shares the same networks or if there are subgroups within a population with different network patterns.

The natural connectivities that exist in the brain can be determined by task-free functional magnetic resonance imaging (fMRI). The subject is told to lie in the scanner and stay awake while his mind is wandering and not trying to complete any specific task. While a brain is in a task-free state, fMRI technology measures the blood-oxygen level dependent (BOLD) signal. The BOLD signal is based on changes in blood oxygenation and blood flow that are used as a surrogate for increases in neural activity. This study compares the time courses of BOLD signals of 264 brain regions for 111 human subjects of different ages and genders.

In this study, for each subject, the time course for each brain region is compared to every other brain region to construct a correlation matrix. These matrices are then clustered using the Dirichlet Process Gaussian Mixture Model to determine if groups of subjects with similar patterns of functional connectivity exist. In order to visualize these high dimensional clusters for inspection by domain experts, networks are constructed by feature selection on the cluster means using the Extra Trees Classifier algorithm. The most useful features for distinguishing each cluster from all other clusters are determined. Features are then visualized as a network representative of that cluster. Lowest ranking features for each cluster are also useful to determine a shared network among the clusters.

The results show that groups of subjects with similar patterns of functional connectivity do exist. The regions from the most distinguishing feature for each cluster are then plotted on an image of the brain to anatomically visualize the results.
Andrea Yanez. Decision Support System for Pandemic Management
Abstract: Countless infectious diseases continue to affect the world, spread by insects, human contact, among others. Epidemic or outbreak, isn’t a new threat to human security, but globalization has changed the rules, allowing for a faster and broader spread and increased risk to become a pandemic. To face these challenges, the European Union (EU) has been continuously dedicated to establish strategies, identify and provide solutions in response to pandemic risk and emergency management. In order to improve and update the current mechanisms, the Pandemic Risk and Emergency Management (PANDEM) project was created [1].

In the context of the project, several Information and Communications Technology challenges have been identified. We are going to focus on providing a Decision Support tool to quicken the reactions in face of a pandemic situation. Specifically, we wish to address two main challenges (1) integration of diverse data sources and (2) data summarization and interactive visualization of the information to support users in taking different types of decisions. We are at an initial stage of assessing current practice to define requirements.

On the one hand, for challenge (1), we need to integrate data from both traditional surveillance and non-traditional sources, such as social media. We would like to analyse the use of microblogs, given that these have proven to be a valuable source of real-time data even though information may not be official or confirmed. Microblog data is very noisy, and for this reason we plan to use classifiers to identify only posts with health-related content. The goal is to create techniques that can extract real-time relevant information during epidemic emergencies.

On the other hand, for challenge (2), both data summarization and data visualization are needed.

In order to provide a summarized overview, we will use clustering techniques to group similar data and detect events. Each group of data surrounding an outbreak needs to be labelled with high level topics that highlight the most relevant information depicted by the cluster. Moreover, clusters need also to be ranked in order to be processed in terms of importance.

Regarding visualisation, given interactive views we can have a better understanding of how the epidemic is spread. Taking into account that microblogs are time/stamped, we can analyse spatio/temporal spread of diseases and population movements by plotting data in interactive maps. A decision support tool must present data by means of a collection of interactive views which organize, filter, find anomalies and provide summarized overviews that allow to navigate the data from different dimensions (e.g. time, space).

In conclusion, appropriate presentation of available information to decision-makers is critical in the management of a pandemic situation. Our overall goal is to offer decision makers with a tool to have an overview of what is happening during an outbreak, offering the right information at the right time. In the context of the PANDEM project, this tool can be vital in order to decrease the response time to face the epidemic, prevent its intensification and its negative impact on society.

[1] http://www.pandem.eu.com/
Maja Svanberg. A multi-class approach for categorizing blocks-programming projects
Abstract: App Inventor is an online environment for building Android apps. It uses block programming in order to democratize programming and empower users to be creators of apps instead of consumers. As part of this mission, there are several resources where the users learn the basics. Some of these include the App Inventor website, the book App Inventor by David Wolber (2011), as well as courses and independent providers. Tutorials from these sources guide the users to get started, and accelerate their learning. Our research goal is to study the effect of tutorials on learning. To accomplish that, we need to classify projects as either tutorials or “creative projects”. This precedes any analysis for measuring learning of computational skills such as abstraction or modularity. Our current focus is to investigate how machine learning algorithms can help us understand learning patterns, much like Piech et al [1].

Our first obstacle was to find a project representation that captures only the computational concepts, and discard other information in the raw XML project files. We chose to represent each building block in a project (e.g. a for-loop or a global variable) as a separate feature, and the entire project as a point in a hyperspace. This is similar to the bag-of-words approach used in natural language processing. We’re not considering the order of blocks, only their presence. We populate the hyperspace with the points representing the known tutorials and then use the 1-Nearest Neighbor classifier to find, for each project, the tutorial closest to it. We found provisional test data by matching the names of the projects to names of tutorials. Applying the 1-nn algorithm on this data, we were able to reach a 70% accuracy.

However, the category “not tutorial” is missing, which would have to include a wide range of projects. Therefore, we need to define thresholds to determine how far in the hyperspace a project can be from the original tutorial and still be classified with its label. While this is rather simple to do on a project by project basis, we strive to find a metric that can be used for normalization of these thresholds. The high dimensionality of the hyperspace (>2000 dimensions) is a contributing factor to why this is a problem. Our enhanced approach breaks the space into several subspaces, composed of different feature combinations. We’re currently building classifiers for these subspaces in order to create an ensemble with a majority voting.

We hope to continue develop approaches to reduce dimensionality. We would also like to find a method of testing our results that doesn’t require manually labelling lots of test data, or using the names of the user projects. As the list of tutorials can be expanded and modified, an automated system to ensure reliability in our identification would make our algorithm easily adaptable to developments in App Inventor and its community.

[1] Piech, C., Sahami, M., Huang, J., & Guibas, L. (2015). Autonomously generating hints by inferring problem solving policies. Paper presented at the 195-204.
Ellyn Ayton and Svitlana Volkova. Predicting Influenza Dynamics with Neural Networks Using Signals from Social Media
Abstract: Motivation: Every year there are 500,000 deaths worldwide attributed to influenza [7]. The Center for Disease Control and Prevention (CDC) reports weekly on the level of influenza-like illness (ILI) seen year round in hospitals and doctor visits. These values are used to monitor the spread and impact of the influenza, however by the time the ILI data is released, the information is already 1-2 weeks old and is frequently inaccurate until revisions are made [8]. To overcome this, we propose making use of large amounts of social media data, such as Twitter to be a secondary source of information in order to predict current and future ILI proportions — the total number of people seeking medical attention with ILI symptoms. In previous related work, flu forecasting has been accomplished through the use of basic linear autoregressive models, linear autoregression exogenous models, SVM regressions, logistic regression classifiers, SIR models, and more [1, 2, 5, 6]. The addition of social media features to several of these models such as the linear autoregressive model, has improved the model’s performances over ILI data alone [3, 4, 8]. Our work is geared toward applying these data sources to more powerful machine learning models. Having this predictive power can aid health officials to properly prepare for and respond to yearly flu outbreaks.

Approach: By integrating the information that people tweet about e.g., topics, syntax, style and their communication behavior e.g., hashtags, mentions, we built predictive models for ILI and confirmed influenza activity across different geographical locations in the U.S. We experiment and evaluate the predictive power of a variety of features and machine learning models e.g., Support Vector Machines with radial basis function or linear kernels, AdaBoost with Decision Trees [10]. We are the first to evaluate the predictive power of neural networks — Long Short Term Memory (LSTM) for ILI nowcasting and forecasting [9]. An LSTM is a special type of recurrent neural network (RNN) that is capable of preserving information and learning long-term dependencies in data, which traditional RNNs struggle with. For this specific reason, we chose LSTMs to model our data over the course of several weeks.

Results: We found that LSTMs achieve the best performance regardless of which text representations are included e.g., embeddings vs. raw tweets. Of our nine features extracted from Twitter, AdaBoost models learned from unigrams, hashtags, and word embeddings consistently outperform all other features. Using up to four weeks of past data, our models are capable of accurately predicting ILI proportions for the current week and predicting ILI values for up to the next two weeks. We have found that a model tailored to a specific location shows a greater performance than a general model encompassing all regions. In our future work, we will apply our LSTM model to 25 additional locations, and combine our ILI and social media data into one predictive LSTM model.
Hannah Rashkin, Yejin Choi and Svitlana Volkova. Forecasting Event-Driven Connotation Dynamics and Classifying Event Types on Twitter with LSTMs
Abstract: The types of events reported in real-time on social media (e.g. Twitter) and the connotation (e.g., perspectives) with which they are discussed are interconnected, dynamic phenomena. When a major world event occurs, topics of discussion change as do the attitudes towards the people and organizations involved. Moreover, these changes vary depending on a Twitter user’s background, such as where they live or what language they speak. To explore this phenomenon, we have created and analyzed a dataset of 6.3 million tweets produced by news accounts in 50 countries and 11 different languages taken from a two-week period around the Brussels attacks on March 22, 2016. We parsed these tweets and extracted part-of-speech tags using SyntaxNet (https://github.com/tensorflow/models/tree/master/syntaxnet).

This work focuses on three main contributions: (1) forecasting entity and event-driven connotation dynamics, (2) event type classification using using neural network models, and (3) analyzing variations and biases in Twitter news accounts perspectives towards salient entities (e.g., people, organizations) across three dimensions – time, language, and location.
First, we trained Long Short-term Memory (LSTMs) models to predict the distribution of perspectives towards an entity on a given day based on the previous days. Other work has investigated the dynamics of overt sentiment and events using time-series analysis [7]. We add to this body of work by using deep learning to analyze connoted perspective, a subtler aspect of sentiment. We run experiments for languages other than English while tracking perspective per country, making this one of the first works to detect changes in sentiment over time with separation of location and language. Analysis of the results show that an LSTM performs well at this task (KL divergence of 1.8 and pearson correlation of 0.8 in English) as well as the harder task of predicting perspective much later in time.

Next, we built upon previous work in event detection by creating a Twitter-specific heuristic for event clustering [1, 6, 3]. Our event clusters are used to automatically predict the event types indicated by a given tweet (e.g., attack, natural disaster), inspired by papers that have previously codified event types [4, 2]. We compared the performance on this task of popular models such as Logistic Regression vs. RNN models with distributed word representations [5]. These results were extended to the task of modeling event types sequentially to predict what types of events will be reported tomorrow.

Finally, we performed an exploratory analysis of Twitter news accounts' perspectives towards salient entities (e.g. people, organizations). We computed these perspectives using the connotation frame lexicon (http://homes.cs.washington.edu/ hrashkin/connframe.html), which captures nuanced connotative relationships (e.g., writer’s perspective towards agent or theme). One contribution of our work is that we have extended this resource to 10 non-English languages by propagating to other languages via parallel corpora. Our analysis is one of the first to compare perspective across three dimensions – time, language, and location. We demonstrate empirically that the general public’s perspective changes over time and also varies across linguistic and geographic borders.

List of references in attached pdf.
Reyhane Askari Hemmat and Abdelhakim Hafid. An Unbalanced Classification Problem: SLA Violation Prediction
Abstract: Cloud computing reduces the maintenance costs of services and allows users to
access on demand services without being involved in technical implementation
details. The relationship between a cloud provider and a customer is governed
with a Service Level Agreement (SLA) that is established to define the level of the
service and its associated costs. SLA usually contains specific parameters and a
minimum level of quality for each element of the service that is negotiated between
a cloud provider and a customer. The failure of providing the service is called an
SLA violation.
From a provider’s point of view, since penalties have to be paid in case an of
SLA violation, violations prediction is an essential task. By predicting violations,
the provider can reallocate the requests and prevent the future violations. From
customer’s point of view, predicting the future violations can be equivalent to
provider’s trustworthiness. Also, the customer would like to receive the service on
demand and without any interruptions. In terms of availability, Amazon Elastic
Cloud faced an outage in 2011 when it crashed and many large customers such as
Reddit and Quora were down for more than one day. Thus, being able to predict
SLA violations favors both the customers and the providers.
To tackle this problem, one can use machine learning models to predict violations.
Violation prediction task can be seen as a classification problem. Using a classifier,
we can predict whether a coming request will be violated or not. In this work, we
explore two machine learning models: Naive Bayes and Random Forest Classifiers
to predict SLA violations. Unlike previous works on SLA violation prediction or
avoidance, our models are trained on a real world dataset which introduces new
challenges that have been neglected in previous works. We test our models using
Google Cloud Cluster trace as the dataset.
Since SLA violations are rare events in real world (∼ 0.2%), the classification task
becomes more challenging because the classifier will always have the tendency
to predict the dominant class. In order to overcome this issue, we use several resampling
methods such as Random Over and Under Sampling, SMOTH, NearMiss,
One-sided Selection, Neighborhood Cleaning Rule and ensemble of them to rebalance
the dataset.
We demonstrate that Random forest with SMOTE-ENN re-sampling technique
achieves the best performance among other methods with the accuracy of 0.9988%
and F1 score of 0.9980. Ensemble methods such as SMOTE-ENN overcome
the problem of overfitting by re-sampling over classes. Random Forest with its
tree based structure is less sensitive to class distributions. Thus, even with no
re-sampling technique, it has an acceptable performance (accuracy = 0.97% and f1
= 0.79). On the other hand, Naive Bayes classifiers are highly biased with class
distribution and do not have acceptable results without re-sampling techniques. It
is worth mentioning that the random forest model also has human-interpretable
results which suggests the most important feature causing the violations.
Minyoung Kim, Stefano Alletto and Luca Rigazio. Similarity Mapping with Enhanced Siamese Network for Multi-Object Tracking
Abstract: Multi-object tracking has recently become an important area of machine vision, and several approaches have been proposed to improve tracking performance in speed and accuracy, with various applications from surveillance systems to autonomous driving, and even sports analytics. Despite growing attention, achieving high performance tracking is still challenging, with state-of-the-art systems resulting in high complexity with a large number of hyper parameters; this may harm robustness especially for real applications in unconstrained environments. Also, as deep neural networks (DNNs) have become popular for their capability to learn rich features, new approaches with DNNs for Multi-object tracking have also been proposed, where they take advantage of Recurrent Neural Networks (RNNs) to incorporate temporal information. Although some of these new methods outperform conventional ones, computational requirements are high, resulting in very low frame rates and latency. Nevertheless, temporal information such as motion flow is crucial in object tracking, therefore cannot be discarded from a model without loss of performance.

To address these limitations, we present a new high speed tracking system, combining both appearance and temporal geometric information, using few hyper parameters, mostly independent from the dataset. We achieve this by leveraging our newly designed Enhanced Siamese Neural Network (ESNN) architecture for similarity mapping: the ESNN is an extended Siamese neural network that combines appearance similarity with temporal geometric information and efficiently learns both visual and geometric features during end-to-end training. The system can be divided into two steps: 1) ESNN-based Similarity Mapping and 2) Matching Algorithm. A Siamese network, referred to as 'Base Network', is pre-trained with the visual similarity. Then, ESNN takes Intersection-over-Union (IoU) and Area Ratio information from pairs of objects as additional features, and builds a combined similarity mapping with both geometric and pre-trained Siamese network features. In this way, ESNN can take temporal information into account, without degrading the overall speed. After the ESNN is fully trained and similarity scores are computed, the matching algorithm produces final tracking results. We also propose a novel heuristic matching algorithm and demonstrate that it performs as good as the widely used Hungarian algorithm at higher speed.

The ESNN-based tracking system achieves competitive performance for both speed and accuracy on MOT16 (Multiple Object Tracking Benchmark 2016) challenge, compared to known state-of-the-art methods. Evaluation results show that using Siamese Neural Network with incorporated temporal information has significant potential for building similarity matrices for multi-object tracking. Last but not least, the system is online (zero latency), which is a key for real time applications such as ADAS and autonomous driving.
Patricia Ordóñez and Abiel Roche-Lima. Classification of Physiological Data for Intelligent Decision-aid Tools for Intensive Care Units
Abstract: Current methods for measuring the well-being of a patient in the intensive care unit (ICU) involve acquisition of a patient's vital signs data. This data is stored in an electronic health record after validation by a medical provider at the rate of once every 15 minutes to once every several hours, and physicians make life-saving decision based on this data. The importance of analyzing this data as a multivariate temporal representation by creating multivariate probabilistic models or temporal abstractions from electronic health records or by creating multivariate structures that are similar to those in other domains such as convolutional neural networks or imaging is well recognized.

Edit distance has been applied to compute similarities for pair of strings and by converting physiological times series data to Symbolic Aggregate approXimation (SAX), we suspected that it could improve the efficiency of its classification. Edit distance is defined as the minimum number of operations, i.e. insertions, deletions, and substitutions, required to transform the input into the output. The cost of insertions, deletions and substitutions are normally predefined. If the transformations are based on an arbitrary random phenomenon, the cost of operations become random variables and edit distance is stochastic. The calculation of stochastic edit distance can be modeled as a stochastic transduction and represented by a stochastic finite-state transducer. Using this model, a probability for each new pair of strings can be assigned, which has proven to be very useful to tackle sequence problems, such as segmentation, DNA alignment, classification, and so on. In this project, we propose to learn the stochastic edit distance of the sequences obtained from the physiological data using a stochastic finite-state transducer.

Another machine learning approach for classification of multivariate time series is the Support Vector Machine combined with Rational Kernels. Rational Kernels, based on Finite-State Transducers (FST), are the most common kernels to manipulate sequence data, which can also be considered strings. N-gram rational kernels are defined as similarity measures between sequences based on the number of equal subsequences they share, where sequences are represented by finite-state transducers. We propose to use n-gram rational kernels for strings on novel multivariate symbolic representations of physiological data and train a model based on an SVM to predict patient clinical status in terms of hypotension in the ICUs. To our knowledge, while SVMs have been used to classify physiological data, no methods have used finite-state transducers or SVM with rational kernels with a symbolic representation of multivariate time series.
Maryamossadat Aghili and Ruogu Fang. Towards High-Throughput Abnormal Brain Screening in MRI Images
Abstract: Automatic disease detection from the MRI brain images has attracted significant research interest in the recent decades. However most research focus on the detection of a particular brain disorder and the majority of publicly available brain MRI datasets are composed of two classes, one includes the normal brain images as negative samples, while other one include defected brain of a specific disease as the positive samples. For instance, The Cancer Genome Atlas (TCGA) has datasets of the healthy and the cancerous organs, and the OASIS dataset is composed of brain images with Alzheimer and the normal brain. In spite of the increasing efforts in brain research, the importance of a general purpose system for normal brain screening has been overlooked which can be highly advantageous for low-income countries or any other places that accessing a radiologist is not easily possible.
One of the main challenges for creating an automatic system for normal brain detection by classification of the MRI brain images is the infrequent nature of some brain disorders. This is unlike many previously seen classification problems which have enough samples of all classes at the training time. It is relatively easy to collect enough normal brain images, but positive data related to different types of brain diseases is hard to obtain. So the training dataset is partial, and a semi-supervised approach should be adopted to tackle this problem.
This paper proposes an automatic abnormality detection system based on brain MRI images. At first thousands of healthy brain and comparatively limited sets of various disordered brain images like the tumor, Alzheimer, trauma and stroke MRI images, are pooled together from several sources with probably different preparation protocols. For redressing the incompatibility and discrepancy in the amalgamated data set, some preprocessing steps have been performed before classification. Then adaptive weighted one-class SVM coupled with the deep learning network and super-resolution technique have been applied to the standardized dataset.
The presented model is capable of screening healthy samples with a relatively high confidence. If the sample image identified as abnormal or suspicious, the patient will be referred to a radiologist. The performance of the proposed algorithm shows a promising direction for achieving a fully automated healthy brain screening system based on the MRI image analysis.
Spandana Gella and Margaret Mitchell. Residual Multiple Instance Learning for Visually Impaired Image Descriptions
Abstract: Vision-to-language (VL) research has reached a critical moment, where significant
improvements can be made by focusing on the architecture and inference
techniques of VL deep learning models. Work in this area has led to large gains
in visual object naming [2], and accounting for different
aspects of visual language [1].

We push this research direction further, introducing a learning architecture
that permits rapid expansion of the vision-language vocabulary. We focus on
a vocabulary increase specifically for the millions of people who are blind or
have severe visual impairments: Image-centric technology, including news and
social media, is not accessible. Some content is embedded with human-written
descriptions (alt-text), but these are of varying quality and availability. The
time is ripe for high-quality automatic image descriptions.

Recent advances in computer vision and natural language processing have led
to many state-of-the-art VL systems. However, these are modeled on datasets
created by sighted users, to evaluate AI systems on the literal context present
in the image: Objects/people, attributes (colors/size), location, and actions.
Existing image description systems currently cannot well describe emotional
content present in the image, nor can they convey highly conceptual image
descriptions. Such capability is critical to accurately convey both news and
personal images.

This paper makes two novel contributions: 1) A VL deep learning architecture
that jointly learns concrete and abstract concepts; 2) A vision-language dataset
and vocabulary in light of the stated preferences from people in the visually
impaired community. Combined, we introduce the first image-description method
that can reliably generate words like celebration, event, beautiful, and sadness.
The increased vocabulary capabilities of 1K words are specifically useful for people
who are severely visually impaired.

The model we propose identifies abstract/conceptual concepts in an image by
leveraging literal/concrete concepts that are easily identifiable. By combining deep
residual learning with Noisy-Or Multiple Instance learning, the proposed architecture
can easily learn joint representations of concretely recognized visual concepts and
external abstract text-based concepts. This drives the prediction of conceptual words
from the concretely recognized concepts in the image. The learned representations
serve as input to a gated recurrent neural network, which can then generate more
expressive and abstract content than has ever been captured previously.

The proposed approach serves as a significant advance in what multimodal neural
networks can model, capturing conceptual aspects of images that until now have
remained unseen.

[1] Misra, I., Zitnick, C. L., Mitchell, M., and Girshick, R.
(2016). Seeing through the Human Reporting Bias: Visual Classifiers from
Noisy Human-Centric Labels. In IEEE Conference on Computer Vision and
Pattern Recognition, CVPR.

[2] Russakovsky, O., Deng, J., Su, H., Krause, J.,
Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M. S.,
Berg, A. C., and Li, F. (2015). Imagenet large scale visual recognition challenge.
International Journal of Computer Vision, 115(3):211–252.
Subarna Tripathi, Zachary C Lipton, Serge Belongie and Truong Nguyen. Context Matters: Refining Object Detection in Video with Recurrent Neural Networks
Abstract: Given the vast amounts of video available online and recent breakthroughs in object detection with static images, object detection in video offers a promising new frontier. However, motion blur and compression artifacts cause substantial frame-level variability, even in videos that appear smooth to the eye. Additionally, in video datasets, frames are typically sparsely annotated. We present a new framework for improving object detection in videos that captures temporal context and encourages consistency of predictions. First, we train a pseudo-labeler, i.e., a domain-adapted convolutional neural network for object detection, on the subset of labeled frames. We then subsequently apply it to provisionally label all frames, including those absent labels. Finally, we train a recurrent neural network with Gated Recurrent Unit (GRU) that takes as input sequences of pseudo-labeled frames and optimizes an objective that encourages both accuracy on the target frame and consistency across consecutive frames. The approach incorporates strong supervision of target frames, weak-supervision on context frames, and regularization via a smoothness penalty.
The strong supervision on the target frames uses multi-part object detection loss that uses both classification and localization ground truth. The context frames without ground truth go through only object-category level weak-supervision by-passing localization for even multiple objects. The regularizer encourages prediction consistency though minimizing euclidean loss between predictions at consecutive frame-pairs. Additionally, another soft constraint through minimizing the difference between pseudo-labels and prediction at every video frame constitute the effective training strategy for the RNN.
Our approach achieves mean Average Precision (mAP) of 68.73, an improvement of 7.1 over the strongest image-based baselines for the Youtube-Video Objects dataset. Our experiments demonstrate that neighboring frames can provide valuable information, even absent labels.

Full text of this work is available [4]

[1] Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi; “You Only Look Once: Unified, Real-Time Object Detection”, The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779-788 YOLO. In CVPR, 2016.

[2] KyungHyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. “On the properties of neural machine translation: Encoder-decoder approaches”. In Proc. Workshop on Syntax, Semantics and Structure in Statistical Translation, 2014.

[3] Alessandro Prest, Vicky Kalogeiton, Christian Leistner, Javier Civera, Cordelia Schmid, and Vittorio Ferrari. Youtube-objects dataset v2.0, 2014.
URL calvin.inf.ed.ac.uk/datasets/youtube-objects-dataset. University of Edinburgh (CALVIN), INRIA Grenoble (LEAR), ETH Zurich (CALVIN).

[4] Subarna Tripathi, Zachary C. Lipton, Serge Belongie, Truong Nguyen, “Context Matters: Refining Object Detection in Video with Recurrent Neural Networks”, In proc. 27th British Machine Vision Conference (BMVC), 2016 https://arxiv.org/abs/1607.04648
Eliana Lorch. Spectral Decomposition of Deep Network Optimization Trajectories
Abstract: Deep neural networks can be trained to perform highly complex tasks beyond the abilities of classical algorithms, but the training process itself—the path that stochastic optimizers take through parameter space, from an initial randomized network to an effectively trained one—is not very well understood. In convex optimization, there is a strong theoretical understanding of the path that different optimization methods take and how fast they converge to the global optimum[1], but deep neural network loss functions are very non-convex.

Geometric intuition might reveal aspects of optimization in practice that could help improve our training techniques or inspire theoretical investigations of particular phenomena. High-dimensional optimization processes are rarely visualized because of the difficulty of representing high-dimensional dynamics in a visually intelligible way. The ubiquitous loss-over-time graph summarizes the effectiveness of training, but doesn’t show dynamics in parameter space.

We present a simple method for rendering the optimization trajectories of deep networks with low-dimensional plots, using linear projections of the spectral decomposition (obtained via principal component analysis). This preserves the trajectory's shape in order to observe patterns that are meaningful in the full parameter space. We show that such plots reveal visually distinctive properties of the training processes, and outline opportunities for future investigation.

We performed exploratory experiments on simple CIFAR-10 and MNIST models trained with SGD and ADAM, with various hyperparameters, and observed a number of patterns using our method. For example, we see that, unsurprisingly, increasing momentum has the effect of smoothing the optimization trajectory and discouraging sharp turns. Our experiments also revealed that under most conditions, the optimization trajectory oscillates along many of the principal component axes, which supports a broad practical interpretation of earlier theoretical work[2] showing gradient descent with momentum to be equivalent in the limit to coupled damped harmonic oscillators. Another surprising observation is that the strongest principal components tend to show predictable behavior of the trajectories (with the trajectory along the first principal component axis being monotonic in cases where training succeeds).

Contributions: Our primary contribution is a simple method for visualizing training trajectories using PCA, which can be used as a tool to investigate the training dynamics of specific networks (as an adjunct to loss-over-time graphs) in pursuit of greater understanding and performance. Secondarily, we note some interesting observations using this method, some of which are potential opportunities for deeper understanding of the training process.

Note: this work was presented at the 2016 ICML visualization workshop (full text with images available at http://icmlviz.github.io/assets/papers/24.pdf).

References (fuller list available in paper above):
[1] Bertsekas, D. P. Incremental gradient, subgradient, and proximal methods for convex optimization. In Optimization for Machine Learning, pp. 1–38. MIT Press, 2011.
[2] Qian, N. On the momentum term in gradient descent learning algorithms. Neural Networks, 12(1):145–151, 1999.
[3] Goodfellow, Ian J, Vinyals, Oriol, and Saxe, Andrew. Qualitatively characterizing neural network optimization problems. preprint arXiv:1412.6544, 2014.
Sandy Huang, David Held, Pieter Abbeel and Anca Dragan. Enabling Robots to Communicate what They've Learned
Abstract: Despite its importance in human-robot interaction, a robot's behavior is often difficult for non-expert users to predict. Motivated by this, our work focuses on increasing transparency regarding a robot's capabilities and preferences for acting in the world. Understanding a robot's capabilities and preferences is key to anticipating how the robot will act in a new situation, verifying what the robot has learned, and deciding whether to trust the robot.

Our insight is that the robot can provide examples of how it acts, to communicate what its current capabilities or preferences are. To determine which examples to provide, we develop a model of human users that captures how viewing examples of robot executions impacts the human's understanding of that robot's capabilities and preferences. We then leverage this model to select a sequence of examples that optimally updates the human's understanding and brings it closer to the ground truth. Our work builds on algorithmic teaching, which focuses on finding the optimal set of training examples to teach a learner a given concept. In contrast to previous work, we focus on teaching human users rather than algorithms, with robot trajectories as training examples.

We frame communicating robot capabilities as teaching humans a classification task: predicting whether or not the robot will succeed on a new task variation. We find that a Generalized Context Model best captures how humans reason about robot capabilities based on the successes and failures they see. In our between-subjects user study on Amazon Mechanical Turk, participants shown examples based on our approach more quickly and accurately learned the robot's capabilities, even compared to participants who selected examples themselves.

Communicating a robot's preferences requires a different model. Formally, a robot typically acts to maximize its reward function, in expectation over some horizon; this is what drives the robot's behavior. Thus, our goal is to communicate this reward function to human users. This is closely related to Inverse Reinforcement Learning (IRL), the process of learning a distribution over reward functions from example behavior. Prior work applied algorithmic teaching to IRL in the context of teaching machines, and assumes exact inference [1]. Instead, we model humans as performing noisy Bayesian inference on the space of possible reward parameters, and show selecting the optimal sequence of examples based on this model performs significantly better than doing so based on the exact-inference IRL model (Welch's t-test p-value of 0.01).

Our contribution is an approach for efficiently communicating robot capabilities and preferences to humans, as well as modeling human learning in the domain of robot trajectories. This results in end-users developing accurate mental models of the robot more quickly, which has two benefits. First, after a user understands a robot's capabilities, the user can better provide additional demonstrations to refine the reward function and address failure modes. Second, users will be able to better predict what the robot will do in new environments, enabling safer interaction and increasing trust.

[1] M. Cakmak and M. Lopes. Algorithmic and human teaching of sequential decision tasks. AAAI 2012.
Yao Qin, Dongjin Song, Haifeng Chen, Geoff Jiang and Garrison Cottrell. Attention-based Recurrent Neural Network for Time Series Prediction
Abstract: Multivariate time series prediction, which forecasts the current value of a time series based upon its previous values as well as the current and past values of multiple driving (exogenous) series, has been studied for decades. Despite the fact that various models have been developed for time series prediction, few of them can capture the long-term temporal dependencies appropriately and select the relevant driving series to make prediction.

To address the problem of long-term dependencies, Long Short-Term Memory (LSTM) based encoder-decoder neural networks have been proposed and achieved great success in natural language processing and image processing. Based upon this framework, an attention mechanism has been introduced in the decoder to mimic the human attention system, which can adaptively focus on the relevant hidden states of encoder, and thus improves the prediction performance.
However, some theories of human attention [1] posit a two-stage mechanism, where the early stage selects the elementary stimulus features and the second phase decides on a target or non-target and rapidly drives the decision process. The existing attention mechanism in the decoder can only select the temporal components of each driving series (a.k.a., input features) with respect to the output, but can not select the driving series itself.

In light of this, we propose a dual-stage attention based recurrent neural network for multivariate time series prediction. In the first stage, we develop a new attention mechanism to adaptively extract relevant driving series at each timestamp by referring to the previous input hidden state. This early-stage attention can relieve the encoder from the burden of encoding all the information. In the second stage, we use the attention mechanism in the decoder to simultaneously select relevant input hidden states across all the previous timestamps conditioned on the previous decoder hidden state. These two different attention models can be jointly trained using standard back propagation. With this new attention scheme and the Long Short-Term Memory (LSTM) structure, our model can not only appropriately capture the long-term temporal dependencies of time series, but also adaptively select the most relevant input features, leading to the reduction of input dimensions and better prediction performance. In order to demonstrate the effectiveness of the proposed algorithm, we collect and construct two real world datasets, a Power Plant Sensor dataset and a NASDAQ 100 Stock dataset (which will be publicly available), with a large number of driving series of long span. With three different evaluation metrics, e.g. MAE, RMSE and MAPE, we find that our model outperforms classic time series prediction methods such as ARIMA and NARA neural networks by a large amount. In addition, we compare our dual-stage attention based recurrent neural network with state-of-the-art attention based neural networks and it also gets better prediction results.

[1]. Hübner R1, Steinhauser M, Lehle C. 2010. A dual- stage two-phase model of selective attention. Psychological review 117(3):759.
Lina Lin, Mathias Drton and Ali Shojaie. Estimation of high-dimensional graphical models using regularized score matching
Abstract: Undirected graphical models, also known as Markov random fields, are important tools for summarizing dependency relationships between random variables and have found application in many fields, including bioinformatics, language and speech processing, and digital communications. Each such model is associated to an undirected graph G. For a random vector X = (X j ) indexed by the nodes of G, the graphical model given by G requires that X j and X k be conditionally independent given all other variables whenever nodes j and k are not joined by an edge in G (Lauritzen,1996). If G is the smallest graph such that X satisfies this requirement, we term G the conditional independence graph of X. In this case, X j and X k are conditionally independent given all other variables if and only if j and k are non-adjacent in G. Specific models can be obtained from additional distributional assumptions. Particularly, an assumption of multivariate normality gives Gaussian graphical models, for which estimation of conditional independence graphs is equivalent to covariance selection (Dempster, 1972).

Literature on selection of conditional independence graphs spans several decades. More recently, attention has shifted to high-dimensional settings with the dimension of X being comparable to or larger than the sample size n. In these settings, the selection problem only remains tractable under additional structural assumptions. In line with such assumptions, a substantial number of methods have been developed for both Gaussian and non-Gaussian settings.

Here, we introduce a novel method of estimating undirected conditional independence graphs based on the score matching loss, introduced by Hyvarinen (2005), and subsequently extended in Hyvarinen (2007).The regularized score matching method we propose applies to settings with continuous observations and allows for computationally efficient treatment of possibly non-Gaussian exponential family models. In the well-explored Gaussian setting, regularized score matching avoids issues of asymmetry that arise when applying the technique of neighborhood selection, and compared to existing methods that directly yield symmetric estimates, the score matching approach has the advantage that the considered loss is quadratic and gives piecewise linear solution paths under l1 regularization. The latter statement holds for many non-Gaussian settings as well. In other words, with regularized score matching, the problem of graph estimation boils to a lasso regression-like problem when using an l1 penalty, which we can easily solve.

Under suitable irrepresentability conditions, we show that l1-regularized score matching is consistent for graph estimation in sparse high-dimensional settings. Through numerical experiments and an application to RNAseq data, we confirm that regularized score matching achieves state-of-the-art performance in the Gaussian case and provides a valuable tool for computationally efficient estimation in non-Gaussian graphical models.
Yulia Rubanova, Jeff Wintersinger, Amit Deshwar, Nil Sahin and Quaid Morris. Topic modelling through time to track tumour evolution
Abstract: Cancer is caused by mutations in our DNA that accumulate throughout lifetime. Cancerous cells accumulate 1000’s of mutations due external (smoking, UV light) or internal (copy errors, failure of DNA damage repair) causes. Multiple mutational sources can be active in the same tumour. Each source creates a unique distribution over types of new mutations, known as a mutational signature. Rates of generating new mutations can vary over time and are known as signature exposures. In machine learning terms, mutation signatures correspond to topics in a topic model over vocabulary of mutation types, and mixing coefficients represent exposures.

Most cancers evolve in stages. Transitions between stages correspond to the gain of additional survival advantages, such as changing patient’s responsiveness to treatment or causing the tumor to evolve to a more aggressive state [Martin, Clinical Cancer Research, 2008]. Our goal is to find such change points in signature exposures, that are linked to the transitions in tumor stage.

We have developed a variant of topic modelling over time on the mutation types to track changes in signature exposures. First, we sort mutations based on relative time of their occurrence in the tumor and divide them into bins of one hundred. Each bin corresponds to one time point. We fit a mixture of multinomials in each bin. Our vocabulary consists of 96 types of mutations. Signatures (topics) are represented as multinomial distributions over 96 mutation types and are derived by Alexandrov [Nature, 2013]. Derived mixture coefficients correspond to exposures of mutational signatures that generated mutations in this bin. Finally, we evaluate uncertainty by bootstrapping set of mutations and recomputing exposure estimates.

To find a change point where signature profiles change substantially, we iterate through all time steps and recompute mixtures of multinomials in time slices formed by a potential change point. A point with maximum likelihood is considered a new change point. We use Bayesian Information Criterion to estimate optimal number of change points.

We applied our approach to 1681 samples of 41 cancer types. Our results show that breast, pancreatic cancer and leukemia have most fraction of tumor samples with sharp changes in the exposures. Over 30% of samples in these cancer types have a signature that changes by more than 20% of overall exposure.

Over time tumors can give rise to new subclones which are genetically distinct from the original tumor. We use PhyloWGS [Deshwar, Genome Biology, 2015] to separate mutations into subclones. We compared subclone mutational groups from PhyloWGS to mutational groups separated by our change points. Changes in signature exposures detected by our method occur approximately at the same pseudo-time as a new subclone appears in the tumor.

Our study enables interesting further directions. Estimating future behaviour of signatures allows to predict occurrence of new subclones in patient. Another possible extension is to derive signatures depending on local genomic features and predict location of future mutations. As mutations determine patient’s response to treatment, our approach enables suggesting treatment strategy based on active signatures and genetic background of the patient.
Sindhu Ghanta, Jennifer Dy, Kivanc Kose, Dana Brooks, Milind Rajadhyaksha and Michael Jordan. A Generative Latent Space and Shape Model for 3D Segmentation of the Dermis/Epidermis in Reflectance Confocal Microscopy Images
Abstract: Segmenting objects of interest from 3D datasets is a common problem encountered in medical data. Optically subtle changes in intensity and low contrast make the task of segmentation difficult. Taking into account, shape information about the object of interest can often help improve segmentation performance. However, shape of objects in biology and medicine are highly variable, and in many situations we are also interested in knowing the number of objects present [1]. We propose a marked spatial Poisson process model that takes uncertainty in location, shape, and appearance into account for automatically detecting the dermis-epidermis junction (DEJ) in 3D reflectance confocal microscopy (RCM) images of human skin.

RCM imaging allows clinicians to view nuclear detail in skin without resorting to biopsy, and the DEJ is an important morphological feature because this is where cancer usually starts. Detecting the DEJ is challenging because RCM images have low contrast and the DEJ structure is highly variable and complex [2]. This structure is composed of a highly variable number of irregularly spaced and variably shaped "peaks and valleys". Therefore, a prior model needs to incorporate this complex structure while allowing variability in its parameters.

We propose a generative model, where the DEJ as a union of an unknown number of "hill" shapes emerging at random locations, using a Marked Poisson Process. In the human skin, these hill tops are very sparse compared to the 3D volume of space. This motivates the use of Poisson as an appropriate prior. This model has three flexible components: a random number of objects (hill tops) at random locations, a parametrized 3D shape model and a parameterized appearance model. The 3D shape model forms a hill structure with parameters that encompass the variation in height, orientation and size of the hill. Appearance is modeled using a multinomial distribution whose parameters are learnt adaptively for each sample.

Inference in models where number of parameters change is typically performed using Reversible jump Markov chain Monte Carlo (RJMCMC) sampling [3]. We develop a Gibbs sampling strategy that can deal with the changing model order and infer the model parameters. This ensures faster convergence as there is no rejection of proposed jumps unlike RJMCMC. Experimental results show that the proposed unsupervised model is able to automatically detect the DEJ with physiologically relevant accuracy in the range 10-20 micrometer.

[1] C. O. de Solorzano, E. G. Rodriguez, A. Jones, D. Pinkel, J. W. Gray, D. Sudar, and S. J. Lockett, “Segmentation of confocal microscope images of cell nuclei in thick tissue sections,” Journal of Microscopy, vol. Pt 3, no. 193, pp. 212–226, 1999.
[2] S. Kurugol, K. Kose, B. Park, J. G. Dy, D. H. Brooks, and M. Rajadhyaksha, “Automated delineation of dermal-epidermal junction in reflectance confocal microscopy image stacks of human skin,” Journal of Investigative Dermatology, vol. 135, no. 3, pp. 710–717, 2014.
[3] R. M. Neal, “Markov chain sampling methods for Dirichlet process mixture models,” Journal of Computational and Graphical Statistics, vol. 9, no. 2, pp. 249–265, 2000.
Vidhya Murali. Music Beyond Genres
Abstract: Representations learnt for items and users from collaborative filtering based models see useful applications in recommendation and personalization. However, they suffer from the known limitations of cold start problem. Another potential risk is the model not capturing seasonality specific patterns. Also, these latent factor representations do not have a one-to-one mapping to real world features. Hence, it is not easy to tag or search for items/users based on these feature values.

There a lot of content based signals for music available such as artist, album, label, release date, contextual tags, keywords mined from web text, lyrics data, description data and audio signals. In this talk, we will discuss our findings around training deep neural networks for multi-label classification of songs to tags such as artist, genre, context tags(example: running, focus, breakup, happy, motivation, workout) using audio signals as input.
This sees application in auto tagging of music in our catalogs and potentially help human curators in the candidate generation process.
The embeddings for songs learnt from this model might also see useful application to search example: Search music by context: sample query: music for rainy day, recommendations (address cold start problems as well as content based recommendations) and personalization. It will also potentially help organize music beyond genres.
Nadia M. Ady and Patrick M. Pilarski. Domains for Investigating Curious Behaviour in Reinforcement Learning Agents
Abstract: From reducing human wait-time on our desktop computers to controlling prosthetic limbs, intelligent systems are rapidly improving to the point where they can have significant benefits to quality of life. Historically, we have developed systems whose operating procedures are designated by their human designers. But the reality is that, in many situations, a system designer cannot be expected to determine the best way to operate in every environment the system could encounter. Rather, the system itself is in the ideal position to determine its own capabilities.

We can observe humans discovering their own capabilities---we see human infants testing their own motor functions and refining these functions into skills to apply in different situations. This intrinsic drive to discover may be called curiosity. Mechanisms to give computational systems a desire to learn or know more are called computational curiosity.

Many researchers have posited that we might generate curious behaviour in machines through the use of reinforcement learning (RL). A number of methods have been developed which assess how curious a behaviour is and then use those assessments to modify the reward delivered to a learner or to modify other parts of an RL algorithm. These methods have shown promise in a number of real-world and simulated targeted domains. However, at present, there is no unified way to compare different curious behaviours produced by these different approaches. Further work is needed to create a coherent understanding of the existing methods for computational curiosity and their underlying principles.

In this work, we describe our initial steps in assessing how different approaches to curiosity in RL lead to different behaviours within a controlled experimental domain. Results from the study of curiosity in psychology have suggested that curious behaviour may be stimulated by interesting variations in the environment. As the principal contribution of this work, we introduce a domain which allows us to investigate the behaviour elicited by different computational curiosity approaches. In this domain, the agent is presented with a single state. Each action available to the agent will result in different variations in the reward externally presented by the environment. We also describe a second new domain that holds reward constant and instead allows us to examine curious behaviours with respect to changes in state, or state-action trajectories. Using these domains, we provide initial insight into how different approaches to computational curiosity drive agent behaviour.

Our results provide a first look at computational curiosity in a unified setting, and are therefore an important step toward a better understanding of curious behaviour in learning systems. We expect the principled understanding of computational curiosity will make significant contributions to the development of general machine intelligence, and provide new solutions to a wide range of real-world problems.
Smitha Milli, Falk Lieder and Tom Griffiths. Optimal Cognitive Systems for Bounded-Rational Metareasoning
Abstract: While optimal metareasoning is notoriously intractable, humans are nonetheless able to adaptively allocate their computational resources. A possible approximation that humans may use to do this is to only metareason over a finite set of cognitive systems that perform variable amounts of computation. The highly influential ``dual-process" accounts of human cognition, which postulate the existence of a slow accurate system and a fast error-prone system, can be seen as a special case of this approximation. This raises two questions: how many cognitive systems should a bounded optimal agent be equipped with and what characteristics should those systems have? We investigate these questions in two settings: a one-shot decision between two alternatives, and planning under uncertainty in a Markov decision process. We find that the optimal number of systems depends on the variability of the environment and the costliness of metareasoning. Consistent with dual-process theories, we also find that when having two systems is optimal, then the first system is fast but error-prone and the second system is slow but accurate.
Erika Strandberg, Catherine Ross, Natalie Pageler and Mohsen Bayati. A new pediatric early warning score: predicting rare events using medical record data in real time.
Abstract: Lack of recognition of clinical deterioration in pediatric patients is associated with an increase in in-hospital mortality. Many of these patients show signs of deterioration in the 24-48 hours prior to emergent transfer to the intensive care unit (ICU)1,2. Pediatric early warning system scores, like bedsidePEWS (bPEWS) have been developed to predict emergent transfers and facilitate the deployment of groups of specially trained personnel or “rapid response teams” (RRTs)3,4. Existing scores require manual calculations and feature entry every 30 minutes to 4 hours by nursing staff, which is often done retrospectively. Furthermore, these methods have all been developed by predicting transfer during a hospital stay instead of probability of transfer in real-time. As such, existing PEWS scores have a high number of false positives and there is social pressure for clinical staff to delay activation of RRTs5.

There is an opportunity to use the massive amount of data available in the electronic medical record (EMR) to design a real-time, data-driven predictive model for pediatric emergent transfers. Such a model must be able to perform feature selection and predict transfers at intervals during a hospital stay and predict rare events. Furthermore, we hypothesize that modeling missing or incomplete data in EMRs will lead to better outcome prediction. The pattern of missingness in EMR data is not at random as many labs or vitals are absent dependent on a patient’s clinical condition and a physician’s intuition.

We extracted complete laboratory results and vitals collected over the first 24 hours of hospital stay for pediatric patients at the Lucille Packard Children’s Hospital between April, 2015 and August, 2016. Cases were patients who were transferred from the floor to the ICU and required a subsequent ICU-level intervention. We addressed the missing value problem by using matrix completion on continuous lab and vital measures and compare performance to data binarized as present/absent. For each case and control we set a random time, t0, and built a model to predict transfers in the following 24 hours. We used logistic regression with a Lasso penalty to perform feature selection due to its interpretability. We also used regularized gradient boosted methods (GBMs), which do not require complete data and allow for nonlinear feature effects. We assessed model performance using area under the receiver operator characteristic curve (AUC) with 5-fold cross validation. We further assess model performance by comparing to the bPEWS score.

Our final model is capable of predicting pediatric emergent transfers in real time with minimal clinician input.
Lisa Wang, Angela Sy, Larry Liu and Chris Piech. What Will You Code Next? Deep Knowledge Tracing on Open-Ended Exercises
Erin Ledell. Scalable Ensemble Learning with H2O Ensemble
Abstract: Ensemble machine learning methods are often used when the true prediction function is not easily approximated by a single algorithm. There is an implicit computational cost to using ensemble methods, since it requires the training of multiple base learning algorithms. Therefore, practitioners may prefer ensemble algorithms when model performance is valued above other factors such as model complexity and training time. We present H2O Ensemble, an open source software package that reduces the computational burden of ensemble learning while retaining superior model performance.

H2O Ensemble is an implementation of the Super Learner ensemble algorithm that uses distributed base learning algorithms via the open source machine learning platform, H2O. The core H2O algorithms are implemented in Java, however, APIs exist for R, Python and Scala.

The Super Learner, also known as stacking, combines multiple, typically diverse, base learning algorithms into a single, powerful prediction function through a secondary learning process called metalearning. There are a handful of supervised machine learning algorithms supported by H2O, all of which can be used as base learners for the ensemble. The following algorithms are currently supported: Generalized linear models (GLMs) with Elastic Net regularization, Gradient Boosting (GBM) with regression and classification trees, Random Forest, Deep Learning (multi-layer feed-forward neural networks), and Naive Bayes.

H2O can be used to speed up machine learning problems on your laptop (as a local multicore cluster), or it can be used in a multi-node cluster setting (for example, on Amazon EC2). We present benchmarks for H2O Ensemble over a variety of cluster configurations on Amazon EC2. The benchmarks include results for a 32-core single-node workstation, and a 96-core and 320-core cluster.
Jianghong Shi, Eric Shea-Brown and Michael A. Buice. Comparing Representations between Mouse Visual Cortex and Deep Neural Networks
Abstract: Deep neural networks (DNN) have been rejuvenated by the emergence of big datasets and the simultaneous increase in computational power. Many tasks performed by DNNs have reached or surpassed human level, including their recent success in the game of Go [1]. These achievements make one wonder if DNNs have to some extent captured the principles of the brain. Interestingly, recent studies (e.g., [2]) show that the task performance of a DNN predicts the extent to which it matches the neural data. This suggests that performance-optimized DNN models have similar statistical properties to real neurons.
In order to assess these similarities, we use the newly-released Allen Brain Observatory dataset [3] from Allen Institute for Brain Science. This unprecedented dataset is a survey of physiological activity across multiple regions, layers, and cell types in the visual system of awake and behaving mice. The mice are presented with visual stimulation using a wide array of stimuli: gratings, locally sparse noise, natural images, and videos. Meanwhile, their brain activity is recorded using two photon calcium imaging. This allows us to compare representations between the cortex and DNN in a new and more detailed way, specifically how representations evolve across putative hierarchies.
In this work, we compare stimulus representations of the mouse visual cortex with pre-trained Google Inception Network (GIN) by regressing real neural responses in mouse visual cortex against features extracted from different layers of GIN. We perform linear regression with 5-fold cross validation for four visual regions, namely VISp, VISal, VISl, VISpm. We found that the regions which best represent the stimuli (across visual areas) are the early concatenating layers in GIN. Moreover, for VISp and VISpm, the best layer is the second concatenating layer, while the best layer for VISl and VISal is the third concatenating layer. This suggests that VISl and VISal could have a higher functional position than VISp and VISpm in the mouse visual pathway. Given that the functional hierarchy of regions in mouse visual cortex above primary visual area is still unknown; our work sheds light on the properties and possible hierarchy of these regions.

1. D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
2. D. L. K. Yamins, H. Hong, C. F. Cadieu, E. A. Solomon, D. Seibert, and J. J. DiCarlo. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences, 111(23):8619–8624, 2014.
3. http://observatory.brain-map.org/visualcoding
Yali Wan and Marina Meila. Model free recovery guarantees in community detection
Abstract: In the last few years, model-based clustering in networks has witnessed spectacular progress. At the central of intact are the so-called block-models, the Stochastic Block Model (SBM), Degree-Corrected SBM (DC-SBM) and Preference Frame Model (PFM). The understanding of these models has been advanced, especially in understanding the conditions when recovery of the true clustering is possible with small or no error. The algorithms for recovery with guarantees have also been improved. However, the impact of the above results is limited by the assumption that the observed data come from the model.

This paper proposes a framework to provide theoretical guarantees for the results of model based clustering algorithms, without making any assumption about the data generating process. To describe the idea, we need some notation. Assume that a graph G on n nodes is observed. A model-based algorithm clusters G, and outputs clustering C and parameters M(G, C).

The framework is as follows: if M(G, C) fits the data G well, then we shall prove that any other clustering C’ of G that also fits G well will be a small perturbation of C. If this holds, then C with model parameters M(G, C) can be said to capture the data structure in a meaningful way.

We exemplify our approach by obtaining model-free guarantees for the SBM and PFM model. To instantiate the framework, we define the following items for these two model classes: 1) Model construction, i.e., an algorithm for fitting a model in M to (G, C) and obtain M(G, C). 2) A goodness of fit measure between M(G, C) and the data G. 3) A distance between clustering. We then show that if M(G, C) fits G well and suppose there exists another clustering C’ such that M(G, C’) fits G well, the difference between C and C’ is upper bounded by a small perturbation. All quantities in our theorems are computable from the data and the clustering C, i.e., they do not depend on unavailable parameters.

Moreover, we show that model-free and model-based results are intimately connected. The techniques we use for model-free results are more direct and more elementary, which can be used for improving the current bounds in model-based results. Meanwhile, with simple modification, the current model-based results can be easily transformed to model-free results using our framework.

This paper makes several contributions. At a high level, it poses the problem of model free validation in the area of community detection in networks. For the first time it connects the stability paradigm to model-based clustering by “turning around” the model-based recovery theorems to a model-free framework. At the technical level, we obtain several different and model-specific stability results, which bound the perturbation of a clustering by the perturbation of a model.
Cheng Zhi Huang. Music Autofill
Abstract: Inpainting is the task of reconstructing an image given a fragment. Prior work in image inpainting has shown that features learned through this task yield competitive and better results in a number of image classification, detection and segmentation tasks. In this work we explore inpainting for music scores represented as images of note positions and durations (piano rolls). The challenge in polyphonic music is that individually, each voice has to adhere to melodic constraints, while collectively the voices need to form a harmonic progression. The inpainting setup allows a model to learn rich and flexible conditional distributions that capture both melodic and harmonic interdependencies between voices.

Our model consists of a deep residual convolutional neural network that takes partial piano rolls as input and produces a probability distribution over completed piano rolls. Our approach can be used for conditioned music generation such as completion of a partial score or variation on existing works. Unlike prior works in music generation, our approach is not constrained to produce notes in chronological order, and hence suffers much less from the accumulation of errors characteristic of chronological models.

We evaluate our method on four-voice Bach chorale completion and qualitatively analyze its behavior.
Nafisa Sharif. A Bayesian Optimisation Model for Activity Prediction in Drug Design
Abstract: We present a novel method of identifying compounds with desirable activity during virtual screening for drug discovery, using Gaussian processes and Bayesian optimisation.
To treat a disease, a drug needs to act at a 'target' in the body, such as a specific enzyme. Usually, the drug is required to inhibit a process at the target; the concentration of the drug required for 50% inhibition is defined as the IC50 or activity[1].
The IC50 is one of the key properties used to determine whether a compound will be an effective drug (its safety and distribution in the body must also be considered). As part of the drug discovery pipeline, thousands or millions of compounds will be experimentally screened to identify promising compounds[2]. The whole process costs billions of dollars[3] and takes up to 15 years[2], leading to the rise of computational methods such as 'virtual screening', used to make the process cheaper and faster.
We have developed a preliminary Gaussian process model to predict the IC50 of compounds from their physicochemical properties, known as descriptors. These properties include the topological surface area of the compound, the number of a particular functional group (for example ketone groups) present in the compound and the number of valence electrons (electrons available for bonding) the compound has.
The model can be used for regression (continuous IC50 prediction), and as an extension it can classify compounds as active or inactive, given their predicted IC50 and an active/inactive threshold from the literature. A dataset of 1000 acetylcholinesterase inhibitors with 108 descriptors was used to test the model using the RBF kernel. The kernel hyperparameters were selected using latin hypercube sampling. With 150 compounds in the test set, the model gives an r-squared value of 0.715 between actual and predicted IC50 values, and it classifies 84% of compounds correctly.
In parallel with the refinement and expansion of compound libraries in traditional drug discovery, Bayesian optimisation is used to select new compounds to update the model. Three different acquisition functions were tested, with the best performance exhibited by the probability of improvement (PI) acquisition function. With 100 rounds of optimisation resulting in 150 test compounds, the r-squared value for the test set was 0.702, and 88% of the test compounds were classified correctly.
For further work, we will be implementing hyperparameter optimisation by maximum likelihood, and more advanced kernels. In particular, we aim to use a property known as the Tanimoto similarity. This is a measure of the similarity between the structures of a pair of molecules, returning a number based on how many functional groups the molecules share. We intend to use this measure in a kernel instead of the Euclidean distance between pairs of input points, thereby incorporating structural information in the model that is not provided by the descriptors.

1. A. D. Rodrigues (2008), Drug-Drug Interactions.
2. J.P. Hughes, S. Rees et al (2011), British Journal of Pharmacology, 162(6):1239–1249.
3. J.A. DiMasi, H.G. Grabowski et al (2016), Journal of Health Economics, 47:20 – 33.
Yuanyuan Pao, James Murphy and Asif Haque. Estimation and Prediction for Vehicle Locations, Trajectories, and Travel Times
Abstract: Estimating and predicting vehicle locations, trajectories, and travel times are important problems in research and industry, with a plethora of algorithmic challenges as well as a potentially large impact on transportation networks. In the past, most of these map-based applications have focused primarily on route prediction and travel time estimation. With the advent of ride-sharing services, creating accurate mapping technologies has become increasingly important. While these services have made vast quantities of location data available to transportation network companies and researchers alike, this data derives from commodity GPS sensors in consumer mobile devices; as such, the data is of variable quality and can suffer from systematic biases beyond standard noise models.

Our dataset is unique, spanning millions of rides and users across most metropolitan areas in the U.S. Typical methods of predicting vehicular movements leverage the regularity of driver behavior within the dataset, but our data is comprised of highly variable routes dependent on time, ride type, and traffic condition. Additionally, ride-sharing has introduced the challenge of requiring models that accommodate different driver behavior: waiting for a ride, driving to a passenger, or dropping off the passenger. For each, we have corresponding device-reported GPS data collected at frequent intervals. Using filtering, smoothing, and various machine learning algorithms in conjunction with OSM (OpenStreetMap), we demonstrate quantifiable improvements for estimating and predicting a number of map-based applications, such as travel time estimation, route prediction and planning, map matching, and real-time vehicle location estimation.

We introduce a driver location estimation model that accurately simulates vehicle movement, a travel time prediction model that significantly improves the mean absolute error to actual travel times, and a map-matching framework that consistently infers true routes from noisy GPS trajectories. We also show how these accurate models improve our ability to make real-time decisions for driver dispatch, fare estimation, and path uncertainty. Our results demonstrate that it is possible to achieve robustness in large-scale mapping applications despite having only noisy location data. Furthermore, these results give promise for future improvements in our pipeline that will increase efficiency and reduce error in transportation networks and ride-sharing.
Samaneh Azadi, Jiashi Feng, Stefanie Jegelka and Trevor Darrell. Auxiliary Image Regularization For Deep CNNs With Noisy Labels
Abstract: Precisely-labeled data sets with sufficient amount of samples are very important for training deep convolutional neural networks (CNNs). However, many of the available real-world data sets contain erroneously labeled samples and those errors substantially hinder the learning of very accurate CNN models. In this work, we consider the problem of training a deep CNN model for image classification with mislabeled training samples – an issue that is common in real image data sets with tags supplied by amateur users. To solve this problem, we propose an auxiliary image regularization technique, optimized by the stochastic Alternating Direction Method of Multipliers (ADMM) algorithm, that automatically exploits the mutual context information among training images and encourages the model to select reliable images to robustify the learning process.

Intuitively, the proposed regularizer exploits the structure of the data and automatically retrieves useful auxiliary examples to collaboratively facilitate training of the classification model. Here, structure of the data means the nonlinear manifold structure underlying images from multiple categories learned from a well-trained deep model on another data set. To some extent, the AIR regularizer can be deemed as seeking some “nearest neighbors” within the training examples to regularize the fitting of a deep CNN model to noisy samples and improve its classification performance in presence of noise.

We use a group sparse norm to automatically select auxiliary images. In contrast to previous works imposing the regularization on the model parameter, we propose to construct groups of input image features and apply the group sparse regularizer on the response maps. Imposing such group sparsity regularization on the classifier response enables it to actively select the relevant and useful features, which gives higher learning weights to the informative groups in the classification task and forces the weights of irrelevant or noisy groups toward zero. The activated auxiliary images implicitly provide guiding information for training the deep models. We solve the associated optimization problem via ADMM.

We demonstrate the effect of AIR on image classification via deep CNNs, where we synthetically corrupt the training annotations. We investigate how the proposed method identifies informative images and filters out noisy ones among the candidate auxiliary images. Going one step further, we then explore how the proposed method improves learning of image classification from user-supplied tags and handles the inherent noise in these tags. Comprehensive experiments on benchmark datasets clearly demonstrate the effectiveness of our proposed method for the large scale image classification task.

Full text of this work, presented at ICLR 2016, is available at http://arxiv.org/abs/1511.07069.
Dianhuan Lin, Eyal Dechter, Kevin Ellis, Joshua Tenenbaum and Stephen Muggleton. Bias reformulation for one-shot function induction
Abstract: This work is about one-shot program induction, that is, given one single input-output example, automatically constructing a program that maps the given input to the target output. We chose logic representation, not just for human readability, but also for the purpose of facilitating debugging and program verification.

In [1], we focused on string transformation tasks such as mapping “Machine Learning” to “ML” or correcting "miKe dwIGHT" to "miKe dwIGHT". Such string transformation tasks have real-world applications in the use of spreadsheet technology. The existing implementation of program induction in Microsoft’s FlashFill (part of Excel 2013) already has strong performance on this problem, and performs one-shot learning, in which a simple transformation program is generated from a single example instance and applied to the remainder of the column in a spreadsheet. We demonstrated in [1] that learning performance can be improved over a series of tasks in the way humans do. In particular, we show in [1] how a functional variant of the recently developed Metagol system can be applied to achieve such tasks. In experiments we study a regime of layered bias reformulation in which size-bounds of hypotheses are successively relaxed in each layer and learned programs re-use sub-functions learned from previous layers. Results indicate that this approach leads to consistent speed increases in learning, more compact definitions and consistently higher predictive accuracy over successive layers. Comparison to both FlashFill and human performance indicates that the new system, Metagol, has performance approaching the skill level of both an existing commercial system and that of humans on one-shot learning over the same tasks. The induced programs are relatively easily read and understood by a human programmer.

[2], an extension of Metagol, has demonstrated the ability of machine-learning an efficient algorithm for sorting, which is equivalent to quicksort. Compared to the neural net based program induction, such as Neural Programmer−Interpreter [3], which also has small sample complexity and generates interpretable programs, Metagol doesn’t require execution trace of the program as part of the training examples. Another advantage of Metagol lies in higher generalization power: the neural programmer-interpreter can handle sequence up to length 70, while Metagol can handle any length. Metagol also supports strategy learning, as its output is not a sequence of functions to be called, but instead a set of rules, which can capture the hierarchical structure in a strategy, such as the recursion in bubble sort and quick sort.

The code for Metagol and the experiment materials for [1] are publicly available as recomputable format: http://recomputation.org/ecai2014; http://recomputation.org/node/40

[1] D. Lin, E. Dechter, K. Ellis, J.B. Tenenbaum, and S.H. Muggleton. Bias reformulation for one-shot function induction. In Proceedings of the 23rd European Conference on Artificial Intelligence (ECAI 2014), pages 525-530, Amsterdam, 2014. IOS Press.
[2] A. Cropper and S.H. Muggleton. Learning efficient logical robot strategies involving composable objects. In Proceedings of the 24th International Joint Conference Artificial Intelligence (IJCAI 2015), pages 3423-3429. IJCAI, 2015.
[3] Scott Reed and Nando de Freitas. Neural Programmer−Interpreters. In International Conference on Learning Representations (ICLR). 2016.
Mariya Vasileva, David Forsyth and Svetlana Lazebnik. What Makes a Meadow “Lush”? MRF-Augmented Generative Networks for Synthesizing Natural Scene Images with Desired Transient Attributes