Berlin Machine Learning Seminar

About Hide

Seminars are generally of a theoretical nature and concern machine learning. It is assumed that seminar members are interested in, and not afraid of, the relevant mathematics.

The idea is to use the seminar as a vehicle for learning — particularly for the speaker. Speakers volunteer to give talks on subjects that they want to force themselves to learn about. As such, seminars can be fundamental or specialised, beginner or less so, depending on what the speaker wants to learn. We are all learners and the atmosphere of the seminar reflects this.

We also host talks from experts in their field.

The seminar is run by Benjamin Wilson, with the help of friends and colleagues. We meet roughly every second week in the offices of Lateral GmbH, in Kreuzberg, Berlin.

Please get in touch if you’d like to come along.

  1. Recurrent Neural Network Grammars
    Kate McCurdy
    Tuesday 16th January 2018 more less
    Dyer, C., Kuncoro, A., Ballesteros, M., & Smith, N. A. (2016). Recurrent Neural Network Grammars. arXiv:1602.07776 [Cs].
    Published in NAACL 2016.
    Retrieved from

    We introduce recurrent neural network grammars, probabilistic models of sentences with explicit phrase structure. We explain efficient inference procedures that allow application to both parsing and language modeling. Experiments show that they provide better parsing in English than any single previously published supervised generative model and better language modeling than state-of-the-art sequential RNNs in English and Chinese.

  2. Learning embeddings in hyperbolic spaces
    Benjamin Wilson
    Thursday 30th November 2017 more less
    Consider the problem of embedding a graph in two-dimensional Euclidean space. Nodes should be close together in the embedding space precisely when they are joined by an edge in the graph (where "close" means e.g. distance <= 1). How would you embed, say, a binary tree? One space-efficient way is to place the root at the origin, and arrange the nodes of depth N on a circle of radius N centred at the root. But things get crowded fast: there are 2^N nodes of depth N, and they have to fit around a circle whose circumference is only linear in N. In fact, already for small N, the distances between the neighbouring nodes on the Nth circle are <=1, and so are implicitly joined by an edge. But then you won't have a tree anymore! In order to proceed, you'll either need to work in a higher dimensional Euclidean space, or work in a space where circles have a longer circumference ...

    In hyperbolic space, the circumference of circles grow exponentially with the radius, so it might be a great place for embedding trees, or indeed for embedding data that has a hierarchical structure. In this talk, we'll introduce a model or two for hyperbolic space and talk about how to train embeddings there, and compare this to the problem of embedding points on a sphere, where things are much easier to visualise.


  3. Learning Semantic Hierarchies via Word Embeddings
    Janna Lipenkova
    Tuesday 7th November 2017 more less
    In this talk, I will first outline the basic concepts of ontology structures, use cases and construction methods for ontologies. Then, I will go into detail on the paper "Learning Semantic Hierarchies via Word Embeddings" by Fu et al. (ACL 2014) which proposes a new method for the construction of semantic hierarchies based on word embeddings, which can be used to measure the semantic relationship between words. The method identifies whether a candidate word pair has a hypernym–hyponym relation by using the word-embedding-based semantic projections between words and their hypernyms.

  4. Mapping paired representations to common spaces with CCA
    Stephen Enright-Ward
    Wednesday 25th October 2017 more less
    Sometimes data scientists have different vectorial representations of the same thing, and wish to interrelate them. For example, a bilingual German/English newspaper might model each article with a pair of vectors: one for the German version, and one for the English. If the two text-to-vector models have been learned independently, the German and English vectors for each article will be dissimilar in general, but related because they represent the same story. Canonical Correlation Analysis, or CCA, is a mathematical technique that both quantifies such relationships, and provides an algorithm for mapping the two sets of vectors into a common space, such that the paired vectors become similar. I will explain what CCA is and how it works.

    Notes are available here.

  5. Fooling neural networks
    Katharina Rasch
    Wednesday 13th September 2017 more less
    It is surprisingly easy to fool deep neural networks (and other machine learning models). Change a few pixels in an image (changes not visible to humans!) and suddenly a zebra is classified with high confidence as a microwave. Let's look at some methods for generating such adversarial examples and discuss what this could tell us about what our models are actually learning.

    Here is a good read if you can't make it.

  6. Generative classification and spotting outliers
    Alan Nichol
    Thursday 24th August 2017 more less
    Classifiers like SVMs and NNs don't give us a real, meaningful probability along with their prediction. They also lack the capacity to recognise 'outlier' data as not belonging to any class in particular. I posit that we need generative models to do this, and we'll try to build a Chinese Restaurant Process + Gaussian mixture model that can (i) model the full data distribution and hence give meaningful probability estimates and (ii) incorporate outliers into new classes.

  7. Using multi-armed bandits for dealing with the cold start problem
    Till Breuer
    Wednesday 9th August 2017 more less
    Bandit algorithms can be used in the context of recommendation systems to reasonably fill the knowledge gap at cold start. Throughout this talk at least one simple bandit approach will be demonstrated at the example of a playful swipe app for exploring the car offerings of an Austrian car reselling platform.

  8. Locality-Sensitive Hashing
    Aaron Levin
    Wednesday 26th July 2017 more less
    In this talk we’ll explore locality-sensitive hashing, a technique to turn the computationally expensive exact nearest-neighbor search problem into an inexpensive approximate solution (it’s a neat trick and I promise you’ll love it). We’ll see how locality-sensitive hashing is used in image search, recommendations, and other machine learning problems. And of course, we’ll mention deep hashing, because why not?

  9. Hierarchical softmax
    Benjamin Wilson
    Wednesday 26th July 2017 more less
    Hierarchical softmax is an alternative to the softmax that scales well in the number of possible classification outcomes. It has been applied in several word embedding models, where the task is predicting a missing vocabulary word, given the context. Indeed, together with negative sampling, it is what has made large-corpus word embeddings computationally viable. It works via a really neat trick, that begins by choosing an arbitrary binary tree whose leaves are the outcomes of the classification. I'll explain to you exactly how it works and explore some questions about the effect of the choice of tree on the classification quality.

    Notes from the talk are written up as a blog post.

  10. Grammatical gender associations outweigh topical gender bias in cross-linguistic word embeddings
    Kate McCurdy
    Wednesday 5th July 2017 more less
    Recent research has demonstrated that the relations captured in vector space semantic models can reflect undesirable biases in human culture. Our investigation of cross-linguistic word embeddings reveals that topical gender bias interacts with, and is surpassed in magnitude by, the effect of grammatical gender associations, and both may be attenuated by corpus lemmatization.

  11. Minsky and Papert's "Perceptrons"
    Benjamin Wilson
    Wednesday 7th June 2017 more less
    Perceptrons are simple neural networks with a single-layer of learned weights. Despite their simplicity, their invention (by Rosenblatt in 1958) brought about a first great wave of optimism for connectionist methods in artificial intelligence. This optimism was brought to a halt by the book of Minsky and Papert, published in 1969, which proved a number of interesting results about the sorts of functions that can and can not be represented by a perceptron. In this talk, we'll review these results and the interesting methods used to prove them (elementary group theory). We'll also take the opportunity to talk also about the perceptron learning algorithm, which uses no knowledge of the error function beyond its evaluation at the training examples - in particular, it doesn't need gradients.

    Notes (pages 1-8)

  12. Learning end-to-end optimized lossy compression codecs
    Simon Wiedemann
    Wednesday 17th May 2017 more less
    At the core of all multimedia technologies we find compression codecs. This are information processing methods that aim to eliminate as much redundancies as possible contained in the raw data of a source (thus, reducing the amount of information needed in order to represent them). In lossy compression schemes, at some point during the process, one quantizes the data values which inevitably results in a loss of information. Thus, in the design of such codecs, one searches for the optimal trade-off between information loss (or distortion) and size reduction (or rate). However, due to the complexity of most real world data sources (like images and videos), solving the posed rate-distortion problem becomes intractable. Therefore, most of todays codecs (e.g. JPEG, H264, etc.) rely on clever heuristics and methods that make the lossy compression problem amenable.
    Nonetheless, in recent years, neural networks have become commonplace to perform tasks that had for decades been accomplished by ad hoc algorithms and heuristics (e.g. object recognition). In fact, recent papers in this field have shown that this powerful class of methods are able to beat standard codecs (like JPEG) for the task of lossy image compression. This results prove their potential to perhaps some day replace the current paradigms of approaching the compression problem.

    In the first part of my talk, I will give a basic introduction to information and source coding theory and explain some of the standard heuristics employed in the field of lossy image compression. In my second part, I will explain the approach taken by Ballé et al. (under review for ICLR 2017), which is based solely on neural networks.

  13. A deeper introduction to reinforcement learning
    Ludwig Winkler
    Wednesday 3rd May 2017 more less
    An introduction to reinforcement learning covering model-based, model-free and approximate methods with deep neural networks.

  14. Sampling-based approaches for language modelling and word embeddings
    Matthias Leimeister
    Wednesday 26th April 2017 more less
    Neural probabilistic language models (NPLM) aim at predicting a word given its context. Using neural networks, those models learn a real-valued dense representation, or word embedding, for each word in the vocabulary that can be used for subsequent tasks. Training an NPLM based on a softmax classifier output for a target word given the context is computationally expensive for large vocabularies, so that various algorithms have been proposed for making the training more efficient. The talk will introduce several sampling-based approximations to the softmax that aim at distinguishing the true target word from a number of noise words, with a focus on noise contrastive estimation and negative sampling (word2vec).

    Andriy Mnih and Yee Whye Teh: A fast and simple algorithm for training neural probabilistic language models. ICML 2012. pdf

    Tomas Mikolov et. al.: Distributed representations of words and phrases and their compositionality. NIPS 2013. pdf

  15. Everybody GANs Now
    Charley Wu
    Wednesday 15th March 2017 more less
    Generative Adversarial Networks (GANs) are one of the biggest topics in Machine Learning right now. I will give a tutorial on the principles behind Adversarial Training, and how it involves finding a Nash Equilibrium between two competing agents, the Generator and the Discriminator. I will also show that this solution has interesting implications in terms of informational-theoretic divergence measures (KL-Divergence and Jensen-Shannon Divergence), which underly all traditional approaches to computational modeling (i.e., finding a maximum likelihood estimate). notes

  16. Are RNNs Turing complete?
    Benjamin Wilson
    Wednesday 1st February 2017 more less
    At the recent NIPS conference, I must have heard it said five times that “RNNs are Turing complete”. All fives time were by Juergen Schmidhuber. At least as many times, I heard other attendees grumbling that this was false. In this talk we’ll explore the assumptions and main ideas of the proof as well as consider whether these assumptions hold for “real-world” RNNs implemented using finite-precision arithmetic.

    If you feel like a head-start, we’ll be covering the paper of Siegelmann and Sontag “On the Computational Power of Neural Nets” (1995). Notes from the talk are available here.

  17. NIPS report
    Marcel Ackermann
    Wednesday 11th January 2017
  18. COLING report
    Kate McCurdy
    Wednesday 11th January 2017
  19. A simple introduction to Bayesian Learning
    Stephen Enright-Ward
    Wednesday 30th November 2016 more less
    Machine learning often means minimising a loss function by randomly initialising its parameters, then updating them repeatedly in response to batches of training data, for example using gradient descent. The idea behind Bayesian Learning is to replace a point estimate of the parameters with a probability distribution over the parameter space, and replace pointwise updates with Bayes' Theorem, to turn this "prior" distribution into a "posterior", which incorporates evidence from the training set. This talk will be an elementary introduction Bayesian Learning, using simple examples and small amount of theory. notes

  20. Deep Neural Networks for YouTube Recommendations
    Alexey Rodriguez Yakushev
    Wednesday 16th November 2016 more less
    A talk on the model and architecture of the YouTube recommendation system (abstract). It should offer some really interesting insights into the workings of one of the world's most heavily-used recommenders!

  21. Connectionist temporal classification
    Matthias Leimeister
    Wednesday 9th November 2016 more less
  22. Predicting incidental comprehension of macaronic texts
    Kate McCurdy
    Wednesday 19th October 2016
  23. A word embedding model for explaining compositionality
    Stephen Enright-Ward
    Wednesday 17th August 2016 more less
  24. Audio featurisation and speaker recognition
    Benjamin Wilson
    Wednesday 27th July 2016 more less
  25. Collaborative Filtering and the NetFlix Prize
    David Yu
    Wednesday 6th July 2016 more less
    We'll describe the Netflix Problem and then go through the paper "Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights” by Yehuda Koren and Robert Bell (Bell Labs) - the eventual winners of the competition. If time permits, description of the Map-Reduce framework and how to apply the approach in the paper to that framework.

  26. Gradient Descent Methods
    Benjamin Wilson
    Wednesday 22nd June 2016 more less
    We study the rate of convergence of gradient descent in the neighbourhood of a local minimum. The eigenvalues of the Hessian at the local minimum determine the maximum learning rate and the rate of convergence along the axes corresponding to the orthonormal eigenvectors. All this material is drawn from Chapter 7 of Bishop’s Neural Networks for Pattern Recognition, 1995. notes

  27. Bayesian Dialog management: how to build a chat interface without a flow chart.
    Alan Nichol
    Thursday 9th June 2016 more less
    Statistical dialogue systems are motivated by the need for a data-driven framework that reduces the cost of laboriously hand-crafting complex dialogue managers and that provides robustness against the errors created by speech recognisers operating in noisy environments. By including an explicit Bayesian model of uncertainty and by optimising the policy via a reward-driven process, partially observable Markov decision processes (POMDPs) provide such a framework. However, ex- act model representation and optimisation is computationally intractable. Hence, the practical application of POMDP-based systems requires efficient algorithms and carefully constructed approximations.

  28. Music similarity and speaker recognition what do they have in common?
    Matthias Leimeister
    Wednesday 11th May 2016 more less
    Music information retrieval has largely been influenced by methods from speech processing. In this talk I want to present a line of research on music similarity models that followed a parallel development in the area of speaker recognition and identification. Specifically I'll discuss a common set of features (MFCCs) and statistical models (single Gaussians, GMM supervectors, and i-vector factor analysis) that have been used successfully in both domains. slides

  29. Factorisation Machines
    Stephen Enright-Ward
    Wednesday 27th April 2016 more less
    Notes available here.

  30. Negative sampling for recommender systems and word embeddings
    Benjamin Wilson
    Wednesday 13th April 2016 more less
    Common to several approaches for training recommenders is contrasting the data observed in a context (e.g. the products purchased by a user) with negative samples chosen from the pool of all possible products. In this talk well review some different ways of choosing negative samples, starting with static approaches like uniform and popularity based sampling and moving through to adaptive approaches (like BPRs adaptive oversampling and WARP) that take into account the context and the models current belief. All of these ideas can be transferred to the world of word embeddings, where the negative sampling techniques are at present less sophisticated.


    1. Steffen Rendle and Christoph Freudenthaler, Improving Pairwise Learning for Item Recommendation from Implicit Feedback, 2014, pdf my notes

    2. Jason Weston, Samy Bengio and Nicolas Usunier, Wsabie: Scaling Up To Large Vocabulary Image Annotation, 2011 pdf my notes

  31. Gaussian Process Models: what to do when you can’t optimize?
    Charley Wu
    Wednesday 16th March 2016 more less
    Optimization problems are ubiquitous in machine learning, but what do we do when we cant optimize? When the problem space is too large, search costs are expensive, or it is not feasible to exhaustively search through all possible solutions, it becomes impossible to find a guaranteed optimal solution. Instead, we need to find the environment-appropriate balance between exploration and exploitation in order to produce a satisfactory solution, within a finite search horizon.

    I will be giving a tutorial on Gaussian Process (GP) models as a method to model Bayesian beliefs about unexplored areas of the problem space. GPs give us both the expected value and uncertainty of each location in the problem space, based on knowledge from previous observations. With a belief model of the environment, we can treat the optimization problem as a bandit problem, with an exploration goal (reduction of uncertainty) and an exploitation goal (closer approximation of the global optima).

    Lastly, I will also introduce a bit of the literature on human exploration-exploitation behavior, and how untrained undergrads in a non-mathematical discipline are better and more efficient than the best state-of-the art optimization algorithms. What makes us good at this task and how can we use this information to build better search algorithms?

  32. Active evaluation of Predictive Models
    Christoph Sawade
    Wednesday 2nd March 2016 more less
    In order to make an informed decision about the deployment of a predictive model, it is crucial to know the model’s approximate performance. To evaluate performance, a set of labeled test instances is required that is drawn from the distribution the model will be exposed to at application time. In many practical scenarios, unlabeled test instances are readily available, but the process of labeling them can be a time- and cost-intensive task and may involve a human expert.

    This talk addresses the problem of evaluating a given predictive model accurately with minimal labeling effort. We study an active model evaluation process that selects certain instances of the data according to an instrumental sampling distribution and queries their labels. We derive sampling distributions that minimize estimation error with respect to different performance measures such as error rate, mean squared error, and F-measures.

    Another instance is the problem of evaluation the performance of a ranking function. In practice, ranking performance is estimated by applying a given ranking model to a representative set of test queries and manually assessing the relevance of all retrieved items for each query. We apply the concept of active evaluation to ranking functions and derive optimal sampling distributions for the commonly used performance measures Discounted Cumulative Gain (DCG) and Expected Reciprocal Rank (ERR).

  33. Hierarchical Self-Organising Maps and Random Swapping for 2-d Sorting
    Kai Uwe Barthel
    Wednesday 17th February 2016 more less
    There are many techniques for dimensionality reduction used to visualize high-dimensional datasets. A typical approach is principal component analysis, however a linear PCA projection cannot preserve local pairwise distances. Other approaches are multidimensional scaling, self-organizing maps (SOMs), local linear embedding or t-Distributed Stochastic Neighbor Embedding (t-SNE). In most cases the high-dimensional datasets are non-linearly projected and shown as points on a 2D map. If images are to be arranged in a 2D fashion, there are further constraints. The images should not overlap and the 2D display should be equally filled. I will talk about different approaches how to visually sort/arrange images using hierarchical SOMs and directed random swapping approaches. One example of such a visually sorted image display is which is a visual image browsing system to visually explore and search millions of images from stock photo agencies and the like. Similar to map services like Google Maps users may navigate through multiple image layers by zooming and dragging.

  34. Counterfactual Performance Estimation or “how can your ML models have a Groundhog Day”
    Alexey Rodriguez Yakushev
    Wednesday 3rd February 2016 more less
    I am currently very interested in evaluating new recommendation models without having to perform A/B testing every single time. So I have been reading a lot on how to properly use log data to evaluate model performance.

    It is very common to use log data to train and perform model selection/hyperparameter search of your models, think for instance of click through rate prediction in which you train models on your users’ click stream, or recommendation models which you train using the consumption histories of your users. Log data is relatively easy to get if you have a large user base but it has serious pitfalls.

    Log data usually has only partial feedback, you know that a user clicked or not on an ad, but you don’t know whether she would have clicked or not on a different ad. This bandit like feedback will introduce bias in your evaluation metrics and will likely give you a distorted view of what will happen at A/B test time.

    We will use Counterfactual Estimators to ask “what if” questions, that is, what would have been the performance if you use a different model than used at logging time. Of course there are no miracles, we will look at limitations of these estimators and their theoretical guarantees.

  35. An intuitive approach to clustering that works … sometimes.
    Vinzenz Schönfelder
    Wednesday 20th January 2016 more less
    This talk presents a clustering method proposed in a recent Science paper. Generally favouring pragmatic approaches to problem solving (as a physicist…), I immediately fell in love with this approach when I learned about it from my colleagues at SISSA. Rodriguez and Laio devised a method in which the cluster centres are recognised as local density maxima that are far away from any points of higher density. The algorithm depends only on the relative densities rather than their absolute values. The authors tested the method on a series of data sets, and its performance compares favourably to that of established techniques. Looking at the method more closely, however, reveals essential weaknesses that depend on apparently minor details in the distribution of the data.

    Rodriguez, A., Laio, A. (2014). Clustering by fast search and find of density peaks. Science, 344(6191), 1492–1496.

  36. Improving Distributional Similarity with Lessons Learned from Word Embeddings
    Marcel Ackermann
    Wednesday 6th January 2016 more less
    A review of the main highlights of the paper of Levy et al. 2015.

  37. Kernels – Duality between Feature Representation and Similarity Measures
    Christoph Sawade
    Wednesday 2nd December 2015 more less
    The questions of ‘how to represent data points’ and ‘how to measure similarity’ are crucial for the success of all methods in data science. Kernel functions can be seen as inner products in some Hilbert space and thus connect feature encoding with the concept of similarity. In this talk, I will elaborate this relationship in the regularized empirical risk framework, which captures lots of standard algorithms like Support Vector Machines, Ridge Regression and Logistic Regression. The goal is to develop a general understanding that is applicable throughout the machine learning world.

  38. MCMC, a space odyssey!
    Simon Pepin Lehalleur
    Wednesday 18th November 2015 more less
    Computing integrals in high-dimensional spaces is a ubiquitous and challenging problem: traditional numerical quadrature methods fail miserably, because they sample the space uniformly while the integrand is usually supported on a very small volume. Markov Chain Monte Carlo is a family of algorithms which solve this problem by sampling more intelligently, using a Markov chain whose stationary distribution is the one we want to integrate. I will explain the basic ideas, including some of the many ways to set up a suitable Markov chain (Metropolis-Hastings, Gibbs sampling)

    Part 1 and Part 2 of M. Bethancourt 2-part course in MLSS 2014.
    Survey of the aspects of Markov chains which bear on MCMC.
    Handbook of MCMC, esp. chap. 1 for a general overview and 5 for Hamiltonian MCMC.

  39. Limiting Distributions of Markov Chains
    Benjamin Wilson
    Thursday 5th November 2015 more less
    A seminar on limiting distributions for (finite-state, time-homogeneous) Markov chains, drawing on PageRank as an example. We see in particular how the “random teleport” possibility in the PageRank random walk algorithm can be motivated by the theoretical guarantees that result from it: the transition matrix is both irreducible and aperiodic, and iterated application of the transmission matrix converges more rapidly to the unique stationary distribution.

    Notes and further reading available here.

  40. Topological data analysis and the Mapper algorithm
    Stephen Enright-Ward
    Thursday 5th November 2015 more less
    Giving a sensible visualisation of point clouds in high dimensions is a difficult problem in data science. Many traditional approaches fit point clouds to surfaces, trying to preserve distances. In the last ten years, people have started to use methods from topology, a branch of mathematics in which “distance” is replaced with the spongier notion of “in the neighbourhood of”, to model data points. The aim is to capture the global shape of the data — the signal — while ignoring local noise. I’ll explain one such approach, called the Mapper algorithm, developed by Singh, Mémoli and Carlsson in this paper

  41. Parameterising the Mahalanobis distances
    Benjamin Wilson
    Wednesday 21st October 2015 more less
    Given pairs of points that are known to be similar, and pairs that are known to the dissimilar, can we learn a metric such that similar points are close together? Well consider the special case of the Mahalanobis distance, which is just the Euclidean distance after applying a linear transform. Posing this problem in different ways leads to very different optimisation problems. Ill share my first steps working on this. The introduction of this survey paper gives a good introduction to the topic.

    Notes from the talk are here.

  42. Report on EMNLP 2015
    Kate M.
    Wednesday 7th October 2015
  43. Report on RecSys 2015
    Marcel Ackermann
    Wednesday 7th October 2015
  44. Limiting Distributions of Markov Chains
    Benjamin Wilson
    Tuesday 6th October 2015 more less
    A seminar on limiting distributions for (finite-state, time-homogeneous) Markov chains, drawing on PageRank as an example. We see in particular how the “random teleport” possibility in the PageRank random walk algorithm can be motivated by the theoretical guarantees that result from it: the transition matrix is both irreducible and aperiodic, and iterated application of the transmission matrix converges more rapidly to the unique stationary distribution.

    Notes and further reading available here.

  45. Pocket-sized Neural Networks
    Adriaan Schakel
    Tuesday 8th September 2015
  46. RNNs for Language Modelling
    Stephen Enright-Ward
    Tuesday 25th August 2015 more less
  47. Nonnegative Matrix Factorization for Audio Source Separation
    Matthias Leimeister
    Tuesday 11th August 2015 more less
    Audio source separation deals with the problem of decomposing a mixture of sound sources into their individual parts. Typical examples include noise suppression in speech signals or the extraction of single instruments from a music mix. A popular class of algorithms to approach the problem is nonnegative matrix factorization (NMF) and its extensions. In the talk I will review the basics of NMF for source separation following a recent survey paper [1] and present some concrete examples for speech and music [2].



    Slides from the talk are available here.

  48. Interpreting word embedding arithmetic as set operations
    Alexey Rodriguez Yakushev
    Tuesday 28th July 2015
  49. Vector length as significance in word embeddings
    Benjamin Wilson
    Tuesday 28th July 2015 more less
    An experimental approach to studying the properties of word embeddings is proposed. Controlled experiments, achieved through modifications of the training corpus, permit the demonstration of direct relations between word properties and word vector direction and length. Written up here.

  50. Attentional models for object recognition in Recurrent Neural Nets.
    Dave Kammeyer
    Tuesday 14th July 2015 more less
    A discussion of an attention-based model for recognizing multiple objects in images, from this paper.

  51. Sparse L1-regularization for Feature Extraction
    Vinzenz Schönfelder
    Tuesday 16th June 2015 more less
    I present the general idea behind regularization in general and its sparse incarnation in particular. As a motivation, I start with the project I worked on during my dissertation. It represents a practical application of sparse regularization for feature selection in the auditory perceptual sciences. By discussing geometric interpretations of L1-regularisation as well as its relation to sparse Bayes-priors, I hope to provide an intuitive understanding of the underlying mechanism. Finally, I discuss standard methods for hyperparameter optimisation in regularised regression.

    Slides are available here.

  52. Expectatation Maximisation and Gaussian Mixture Models
    Benjamin Wilson
    Thursday 21st May 2015 more less
    A talk on Expectation Maximisation where Gaussian Mixture Models are considered as an example application. The exposition follows Bishop section 2.6 and Andrew Ng’s CS229 lecture notes. If you weren’t at the seminar, then it is probably better to read one of these instead. Another useful reference is likely the 1977 paper by Dempster et al. that made the technique famous (this is something I would have liked to have read, but didn’t).

    Notes are here.

  53. Convolutional Neural Nets NLP
    Adriaan Schakel
    Wednesday 22nd April 2015
  54. Non-negative Matrix Factorisation
    Benjamin Wilson
    Thursday 12th February 2015 more less
    A talk on non-negative matrix factorization (NMF), its probabilistic interpretation and the optimization problems it poses. I find non-negativity constraints really interesting from the point of view of model interpretability, and NMF is a famous example. Most of us will have see the example of facial image decomposition using NMF before. If you wanted to read something yourself, you could start with Lee and Seungs paper in Nature.

  55. Gaussian processes
    Margo K.
    Monday 15th December 2014
  56. Negative sampling and noise contrastive estimation in the context of probabilistic language modeling
    Adriaan Schakel
    Monday 8th December 2014
  57. Deep Belief Nets
    Stephen Enright-Ward
    Monday 24th November 2014
  58. Energy-based models, inference and Gibbs Sampling
    Benjamin Wilson
    Monday 27th October 2014 more less
    Notes available here.

  59. Latent Dirichlet Allocation
    Marija V.
    Monday 20th October 2014
  60. Estimator theory and the bias/variance trade-off
    Benjamin Wilson
    Monday 11th August 2014 more less
    A talk on the bias-variance trade-off, including the basics of estimator theory, some examples and maximum likelihood estimation.

    Notes are here. See also squared pair-wise differences estimate the variance.

  61. Dependently-typed programming languages
    Andreas G.
    Friday 8th August 2014 more less
    The presentation will be an introduction to dependently typed programming languages. I hope to give a brief glimpse into how advanced type systems allows the programmer to be more explicit in the way he expresses properties of his programs. In fact, the system is rigorous to the point where we can represent logical propositions by types, and further prove our proposition by writing a program of the corresponding type.

  62. Restricted Boltzmann Machines
    Stephen Enright-Ward
    Monday 21st July 2014
  63. Feasibility of Learning
    Marija V.
    Monday 30th June 2014 more less
    The VC dimension and learning feasibility.

  64. Mathematical properties of Principal Component Analysis
    Benjamin Wilson
    Monday 19th May 2014 more less
    Here are some notes from the talk.

  65. Transfer learning
    Heiko Schmidle
    Monday 28th April 2014
  66. Word2vec and the Distributional Hypothesis
    Benjamin Wilson
    Monday 14th April 2014 more less
    An overview of word2vec, covering the CBOW learning task, hierarchical softmax, and the fundamental linguistic assumption: the distributional hypothesis.

    Here are better slides from a later talk on the same subject.