Seminars are generally of a theoretical nature and concern machine learning. It is assumed that seminar members are interested in, and not afraid of, the relevant mathematics.
The idea is to use the seminar as a vehicle for learning — particularly for the speaker. Speakers volunteer to give talks on subjects that they want to force themselves to learn about. As such, seminars can be fundamental or specialised, beginner or less so, depending on what the speaker wants to learn. We are all learners and the atmosphere of the seminar reflects this.
We also host talks from experts in their field.
The seminar is run by Benjamin Wilson, with the help of friends and colleagues. We meet roughly every second week in the offices of Lateral GmbH, in Kreuzberg, Berlin.
Please get in touch if you’d like to come along.

Unsupervised machine translation
Joris Dolderer
Wednesday 16th October 2019
more
less
Unsupervised machine translation systems require only monolingual text corpora  i.e. no dictionary or parallel sentences  for training. After rapid advances in the last few years, the current state of the art methods [1] achieve BLEU scores comparable to or better than the best supervised systems from 2014. Many of these systems incorporate oldfashioned statistical machine translation (SMT) systems, rather than relying on neural machine translation exclusively. Furthermore, they often start with crosslingual word or phrase embeddings learned in an unsupervised fashion, using algorithms such as VecMap [2].
The aim of this talk is to give an overview of how an unsupervised MT system works. For concreteness, I will focus on the system developed by Artetxe et al. [3], which incorporates VecMap.
Slides[1]
https://paperswithcode.com/task/unsupervisedmachinetranslation[2]
https://arxiv.org/abs/1805.06297[3]
https://arxiv.org/abs/1902.01313

Neural Networks Don't Learn Default Rules for German Plurals,
But That's Okay, Neither Do Germans
Kate McCurdy
Tuesday 3rd September 2019
more
less
Can artificial neural networks learn to represent inflectional morphology and generalize to new words as human speakers do? Some linguists have argued that the German number system cannot be modeled without rulebased symbolic computation, because the `default' plural marker, /s/, is also the least frequent; they claim that speaker preferences for /s/ in elsewhere conditions, such as novel and phonologically atypical nouns, require representation of linguistic rules and thus cannot be inferred from data alone.
We present a new dataset of German speakers' production and rating of plural forms for novel nouns, and note that the results provide at best weak support the claimed `default' status for /s/, reducing its potential challenge for neural models. Nonetheless, we observe that neural encoderdecoder models, while broadly successful on this `wug' task, show distinctive failure modes suggesting they do not generalize in quite the same manner as human speakers.

A StyleBased Generator Architecture for Generative Adversarial Networks
Katharina Rasch
Wednesday 5th June 2019
more
less
You might have seen
this face generator that creates creepily
realistic faces? Let's take a look
at the paper behind this method and learn how they pulled it off. This
paper is a nice combination of different techniques that have been
developed in GANs over the last years and a chance to learn a bit about
what has been going on in this field. Some things I will talk about are
progressive GANs, GAN evaluation metrics, dataset preparation, maybe a
bit on different loss functions that people use.
Original paper abstract:
arXivWe propose an alternative generator architecture for generative
adversarial networks, borrowing from style transfer literature. The
new architecture leads to an automatically learned, unsupervised
separation of highlevel attributes (e.g., pose and identity when
trained on human faces) and stochastic variation in the generated
images (e.g., freckles, hair), and it enables intuitive,
scalespecific control of the synthesis. The new generator improves
the stateoftheart in terms of traditional distribution quality
metrics, leads to demonstrably better interpolation properties, and
also better disentangles the latent factors of variation. To quantify
interpolation quality and disentanglement, we propose two new,
automated methods that are applicable to any generator architecture.
Finally, we introduce a new, highly varied and highquality dataset of
human faces.

GPyTorch: A scalable approach to Gaussian Processes
Ludwig Winkler
Wednesday 15th May 2019
more
less
Gaussian Processes are a class of popular and powerful probabilistic models which can be derived straight from a Normal distribution. While vanilla GP’s are very flexible due to their nonparametric formulation, inference and training both rely on the evaluation of the entire data set through several kernel matrices, their products, log determinants and inverses. These kernel matrices over the entire training data set and their numerous use in matrix matrix pdocuts, log determinants and inverses result in poor scalability to large data sets. The advent of GPU’s in machine learning has offered the possibility of computing matrix vector and matrix matrix products efficiently and in parallel. GPyTorch introduces several randomized and parallelized algorithms as replacements for exact computations which reduce the complexity of inference and training in GP’s. These include stochastic approximations of the trace, parallel conjugate gradient descent and several efficient applications of eigendecompositions. More importantly, all of these adaptations allow full utilization of parallel hardware during inference and training. Considerable focus will be on conjugated gradient descent which allows the efficient optimization of quadratic optimization problems through the use of line search and conjugate search directions.

Reinforcement Learning as Probabilistic Inference
Mathias Schmerling
Wednesday 27th March 2019
more
less
Among the many challenges facing (Deep) Reinforcement Learning algorithms, most notably when advancing into realworld robotics applications, is their lack of stability: stateoftheart algorithms are very sensitive to the choice of hyperparameters and asymptotic performance can vary widely across random seeds. A modification of the standard RL framework called Maximum Entropy Reinforcement Learning holds the promise to address this. In maximum entropy RL, the agent optimizes a tradeoff between the expected reward and entropy of its policy, i.e. tries to solve the task while behaving as randomly as possible.
While often framed as a purely algorithmic modification, surprisingly, maximum entropy RL can be identified with probabilistic inference in a particular graphical model. Drawing a link between Reinforcement Learning and Probabilistic Inference is interesting as graphical models allow for intuitive reasoning about partial observability, compositionality and (hierarchical) latent variables. It stands to reason that a probabilistic inference perspective can inspire new algorithms that are able to address the stability issues of Deep RL. This talk will explain the relation between RL and Probabilistic Inference and is based on "Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review" by Sergey Levine.

Inferring evolutionary histories with maximum likelihood in tree space
Benjamin Wilson
Wednesday 27th February 2019
more
less
This talk discuss the problem of "phylogenetic inference", i.e. the task of inferring evolutionary history from available (typically present day) genetic samples. No background in biology is assumed  we'll just be interested in the inference method.
We'll cover:
 the JukesCantor model of DNA evolution as a continuous time Markov chain;
 the application of maximum likelihood to determine the edge lengths in an evolutionary tree, given DNA sequence data;
 the combinatorial tree modifications that are used to move through the "space" of all possible trees.
One of the most interesting aspects (to me) is that the optimisation takes place in a space that is both continuous (you can change the edge lengths on a tree smoothly) and combinatorial (the collection of possible tree shapes, i.e. disregarding the edge lengths, is discrete). So what's the best way to do it?

Inference over graph structures using a diffusion kernel
Charley Wu
Wednesday 5th December 2018
more
less
Many types of problems require highly structured representations, which can be better described using graph structures rather than vector space representations. In this talk, I will describe a method for performing inference over graph structures using the Gaussian Process regression framework. I will introduce the Diffusion Kernel, which provides a method for estimating the covariance structure of any weighted or unweighted undirected graph, based on it's adjacency matrix. The Diffusion kernel can be understood as an extension of the Successor Representation for Bayesian inference, allowing us to compute predictive posterior estimates of the expected value and underlying uncertainty of unobserved states. I will provide a tutorial on how to use Diffusion kernel for performing inference, and draw connections to a variety of Machine Learning problems, including reuse of learned structures in Reinforcement Learning problems.

The Transformer
Adriaan Schakel
Thursday 15th November 2018
more
less
The attention mechanism introduced by Dzmitry Bahdanau et al. [1] in the context of neural machine translation has found application in a wide range of tasks in natural language processing [2,3] and beyond. Recently, Alec Radford et al. [4] proposed to use the mechanism to effectively train a language model on a corpus of unlabeled text. They demonstrated that their general, taskagnostic pretrained model outperforms discriminatively trained models with architectures specifically crafted for the task at hand.
This talk introduces the attention mechanism, provides an intuitive application in the context of computer vision, and details the specific iteration of the mechanism used in the Radford paper, viz. the transformer decoder [3].
 Neural machine translation by jointly learning to align and translate, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.
 Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin.
 Generating Wikipedia by Summarizing Long Sequences, Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer
 Improving Language Understanding by Generative PreTraining, Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Their code can be found here.

On Laplacian Eigenmaps for Dimensionality Reduction
Juan Camilo Orduz
Thursday 4th October 2018
more
less
The aim of this talk is to describe a nonlinear dimensionality reduction algorithm based on spectral techniques introduced in Belkin & Niyogi (2003).
The goal of nonlinear dimensionality reduction algorithms, so called manifold learning algorithms, is to construct a representation of data on a low dimensional manifold embedded in a high dimensional space. The particular case we are going to present exploits various relations of geometric and spectral methods (discrete and continuous). Spectral methods are sometimes motivated by Marc Kac's question Can One Hear the Shape of a Drum? which makes reference to the idea of recovering geometrical properties from the eigenvalues (spectrum) of a matrix (linear operator). Concretely, the approach followed in Belkin & Niyogi (2003) has its foundation on the spectral analysis of the graph Laplacian of the adjacency graph constructed from the data (von Luxburg 2007). The motivation of the construction relies on the continuous limit analogue, the LaplaceBeltrami operator, in providing an optimal embedding for manifolds. We will also indicate the relation with the associated heat kernel operator (Ham. 2004). Instead of following a pure formal approach we will present the main geometric and computational ideas of the algorithm. Hence, with basic knowledge of linear algebra (eigenvalues) and differential calculus you will be able to follow the talk. Finally we will show a concrete example in Python using scikitlearn.

An Introduction to Homomorphic Encryption
Stephen EnrightWard
Wednesday 19th September 2018
more
less
Classically, one must decrypt encrypted data in order to process it usefully. This restriction forces us to trust the security of third party computational resources, e.g. in the cloud. In the last decade, Gentry and others have invented new methods enabling arbitrary computations on encrypted data, yielding encrypted outputs whose decryptions are the “correct” answers — i.e. the output of the same computation on the original plain text. Such schemes are called “homomorphic encryption”, since encryption and function evaluation commute in this way. I’ll review a simple homomorphic encryption scheme, and explain how and why it works.
No prior reading required, but if you'd like a headstart, here is an
introduction from Gentry Handwritten notes.

Surface Realization with Neural SequencetoSequence Inflection and Incremental LocalityBased Linearization
David King
Wednesday 18th July 2018
more
less
Surface realization is a nontrivial task as it involves taking structured data and producing grammatically and semantically correct utterances. Many competing grammarbased and statistical models for realization still struggle with relatively simple sentences. For our submission to the 2018 Surface Realization Shared Task, we tackle the shallow task by first generating inflected wordforms with a neural sequencetosequence model before incrementally linearizing them. For linearization, we use a global linear model trained using early update that makes use of features that take into account the dependency structure and dependency locality. Using this pipeline sufficed to produce surprisingly strong results in the shared task. In future work, we intend to pursue joint approaches to linearization and morphological inflection and incorporating a neural language model into the linearization choices.

Authorship modeling and prediction using latent topics (or LDA)
Martin Stamenov
Wednesday 18th July 2018
more
less
The advent of faster personal computers and tools such as automatic differentiation has enabled deep learning software like Theano. Software like this leads to wide adoption of deep learning, which in turn leads to an explosion in interest and new discoveries in the field. In a similar fashion, fast inference in probabilistic models with high quality and userfriendly implementations can lead to more interest and hence more research in Bayesian data analysis.
The main algorithm in this research is the fairly recent extension of LDA  Authortopic model. It promises to give data scientists a tool to simultaneously gain insight about authorship and content in terms of topics. The authors can represent many kinds of metadata attached to documents, for example, tags on posts on the web. The model can be used for data exploration, as features in machine learning pipelines, for author (or tag) prediction, or to simply leverage your topic model with existing metadata. Starting with the recent implementation of the Authortopic model in Gensim, we build on top of this work by creating a new feature, which allows inference of topics on a new collection of corpus data. We use it as querying tool for our training data in terms of finding similar authors or “tags”, by means of distance functions. We measure its authorpredictive performance on two separate datasets. The research ended with an accepted and merged pull request and and a Jupyternotebook[1] tutorial in Gensim’s Github repository, as this functionality seemed to be of interest for many [2] people.
[1] https://github.com/RaReTechnologies/gensim/pull/1766
[2] https://github.com/RaReTechnologies/gensim/blob/develop/docs/notebooks/atmodel_prediction_tutorial.ipynb

Machine learnathon
allin
Sunday 17th June 2018
more
less
It's like a hackathon, but for our machine learning ideas and projects. Let's choose something we want to learn about, break up into small groups (or go solo) and just do our best to learn about it for a day. It could be some theory, or a paper, or some code, whatever you like. We will teach and learn from each other, so please add to the spreadsheet below any topics you would like to learn about, and what help you could offer to others. It will be collaborative, not competitive, so there are no prizes. You can give a short presentation at the end of the day, if you like, or not, if you don’t. Afterwards, we’ll have dinner together. Hoping to see you there!

Neural Machine Translation  Better pay attention when you’re translating
Ludwig Winkler
Wednesday 23rd May 2018
more
less
Machine translation made a big step forward through the use of deep recurrent neural networks paired with an attention mechanism. Motivated by the recent advances I will explain how the sequencetosequence neural machine translation system used in Google Translate works, what role the attention mechanism plays in it and how zeroshot learning for translation is possible with multilingual translation systems.
Furthermore I will explain how the ‘Transformer’ architecture works which relies solely on attention and eliminates the sequential dependencies while processing information.
A great article that sparked my interest in NMT a while back can be found here:
https://www.nytimes.com/2016/12/14/magazine/thegreataiawakening.htmlThe papers which are the basis for this talk are:

Logistic regression on Riemannian manifolds
Matthias Leimeister
Wednesday 9th May 2018
more
less
In classification problems, item features are usually represented as a vector in Euclidean space. Common classifiers such as logistic regression or support vector machines can then be trained to distinguish two or more classes using these features. For some data sets, however, there is a more natural representation of the data on a lower dimensional manifold. For example, L2normalized feature vectors are naturally points on a unit sphere. Graph and word embeddings in hyperbolic space, which have gained much attention recently, provide another example [1].
In general, Riemannian manifolds have different geometric properties than flat Euclidean space. For example, the shortest path between two points, also called a geodesic, might not be a straight line when viewed from the ambient space. Also, in the Euclidean setting the scalar product is often used as a similarity function between two vectors. On a Riemannian manifold, a scalar product is only defined within the tangent space at each point, therefore requiring some other way to relate the points to each other. This makes it necessary to rethink the formulation of a linear classifier in terms of these geometric structures. Lebanon and Lafferty [2] propose to reformulate logistic regression on the unit sphere by interpreting the loss function geometrically and transforming it to the manifold setting.
In the talk, I will present some basic concepts from Riemannian geometry, introduce the classifier used by Lebanon and Lafferty and outline an extension for hyperbolic space.
[1] Maximilian Nickel, Douwe Kiela:
Poincaré Embeddings for Learning Hierarchical Representations. NIPS 2017.
[2] Guy Lebanon, John Laffertey:
Hyperplane Margin Classifiers on the Multinomial Manifold. ICML 2004.
Notebook Notes

Bayesian compression of Deep Neural Networks
Simon Wiedemann
Wednesday 25th April 2018
more
less
Every ML practitioner knows that neural networks are stateoftheart in a wide spectrum of tasks. However, their large amount of weights and computations difficult their application in real world scenarios, in particular in cases where they need to be deployed into resource constrained devices. Therefore, compression and efficiency has recently become an active topic of interest in the deep learning community.
Motivated from Bayes' theorem and the minimum description length principle, in this talk I will present variational dropout, a powerful technique that regularises the training of deep neural networks into being low complex. The series of papers that have developed this technique have shown that: 1) variational dropout can be viewed as generalisation of the famous dropout method and thus preforms as good or better than it, 2) the resulting trained networks can be highly compressed. In fact, authors report that they can prune up to 99.5% of the weights of LeNet5 away, or compress the VGG network by 95x of it’s original size.
Slides from the talk are available
here.
Papers:
arxiv.org/pdf/1506.02557arxiv.org/pdf/1701.05369arxiv.org/pdf/1705.08665

Hierarchical Reinforcement Learning with Options Framework
Oguz Serbetci
Wednesday 4th April 2018
more
less
Despite recent breakthrough applications of Reinforcement Learning, temporal abstraction remains a longstanding goal. Instead of making decisions at atomic timesteps, temporal abstractions allow a reinforcement learning agent to act in different timescales. For instance, while navigating a room an agent needs to make the highlevel decision regarding which door to use, but it also need to chose lowlevel actions: the steps required to reach the door. In this talk, we will look at an approach from RL literature called the Options Framework (1999 Sutton, Precup, Singh) with its recent extension the OptionCritic Architecture (2016 Bacon, Harb, Precup).
You do not have to possess a deep understanding of Reinforcement Learning as the basics will be explained when needed.
The Options Framework The OptionCritic Architecture

Breaking simple substitution ciphers with simple algorithms
Stephen EnrightWard
Thursday 22nd March 2018
more
less
Some historical texts are now unreadable, either because the script, or the underlying language, or both, are unknown. If record of the language still exists but the script is unfamiliar — for example, because the text is written in German, but in an unfamiliar alphabet — then reading the text is a twostep decipherment problem: First recognise the language, then map the unknown script to a familiar one (this applies both when the script was once widely used but is now forgotten, and when the script is a deliberate encryption scheme). I will discuss naive statistical algorithms to decipher such texts, applicable only for simple encryption techniques. Such techniques have been used successfully by Knight, Megyesi and Schaefer to
break the Copiale Cipher, in 2011, and have also been applied, so far unsuccessfully, to the
Voynich Manuscript.
This talk contains no original work and no machine learning.
Notes are available here.

Recurrent Neural Network Grammars
Kate McCurdy
Tuesday 16th January 2018
more
less
Dyer, C., Kuncoro, A., Ballesteros, M., & Smith, N. A. (2016). Recurrent Neural Network Grammars. arXiv:1602.07776 [Cs].
Published in NAACL 2016.
Retrieved from http://arxiv.org/abs/1602.07776
We introduce recurrent neural network grammars, probabilistic models of sentences with explicit phrase structure. We explain efficient inference procedures that allow application to both parsing and language modeling. Experiments show that they provide better parsing in English than any single previously published supervised generative model and better language modeling than stateoftheart sequential RNNs in English and Chinese.

Learning embeddings in hyperbolic spaces
Benjamin Wilson
Thursday 30th November 2017
more
less
Consider the problem of embedding a graph in twodimensional Euclidean space. Nodes should be close together in the embedding space precisely when they are joined by an edge in the graph (where "close" means e.g. distance <= 1). How would you embed, say, a binary tree? One spaceefficient way is to place the root at the origin, and arrange the nodes of depth N on a circle of radius N centred at the root. But things get crowded fast: there are 2^N nodes of depth N, and they have to fit around a circle whose circumference is only linear in N. In fact, already for small N, the distances between the neighbouring nodes on the Nth circle are <=1, and so are implicitly joined by an edge. But then you won't have a tree anymore! In order to proceed, you'll either need to work in a higher dimensional Euclidean space, or work in a space where circles have a longer circumference ...
In hyperbolic space, the circumference of circles grow exponentially with the radius, so it might be a great place for embedding trees, or indeed for embedding data that has a hierarchical structure. In this talk, we'll introduce a model or two for hyperbolic space and talk about how to train embeddings there, and compare this to the problem of embedding points on a sphere, where things are much easier to visualise.
Notes

Learning Semantic Hierarchies via Word Embeddings
Janna Lipenkova
Tuesday 7th November 2017
more
less
In this talk, I will first outline the basic concepts of ontology structures, use cases and construction methods for ontologies. Then, I will go into detail on the paper "Learning Semantic Hierarchies via Word Embeddings" by Fu et al. (ACL 2014) which proposes a new method for the construction of semantic hierarchies based on word embeddings, which can be used to measure the semantic relationship between words. The method identifies whether a candidate word pair has a hypernym–hyponym relation by using the wordembeddingbased semantic projections between words and their hypernyms.

Mapping paired representations to common spaces with CCA
Stephen EnrightWard
Wednesday 25th October 2017
more
less
Sometimes data scientists have different vectorial representations of the same thing, and wish to interrelate them. For example, a bilingual German/English newspaper might model each article with a pair of vectors: one for the German version, and one for the English. If the two texttovector models have been learned independently, the German and English vectors for each article will be dissimilar in general, but related because they represent the same story. Canonical Correlation Analysis, or CCA, is a mathematical technique that both quantifies such relationships, and provides an algorithm for mapping the two sets of vectors into a common space, such that the paired vectors become similar. I will explain what CCA is and how it works.
Notes are available
here.

Fooling neural networks
Katharina Rasch
Wednesday 13th September 2017
more
less
It is surprisingly easy to fool deep neural networks (and other machine learning models). Change a few pixels in an image (changes not visible to humans!) and suddenly a zebra is classified with high confidence as a microwave. Let's look at some methods for generating such adversarial examples and discuss what this could tell us about what our models are actually learning.
Here is a good read if you can't make it.

Generative classification and spotting outliers
Alan Nichol
Thursday 24th August 2017
more
less
Classifiers like SVMs and NNs don't give us a real, meaningful probability along with their prediction. They also lack the capacity to recognise 'outlier' data as not belonging to any class in particular. I posit that we need generative models to do this, and we'll try to build a Chinese Restaurant Process + Gaussian mixture model that can (i) model the full data distribution and hence give meaningful probability estimates and (ii) incorporate outliers into new classes.

Using multiarmed bandits for dealing with the cold start problem
Till Breuer
Wednesday 9th August 2017
more
less
Bandit algorithms can be used in the context of recommendation systems to reasonably fill the knowledge gap at cold start. Throughout this talk at least one simple bandit approach will be demonstrated at the example of a playful swipe app for exploring the car offerings of an Austrian car reselling platform.

LocalitySensitive Hashing
Aaron Levin
Wednesday 26th July 2017
more
less
In this talk we’ll explore localitysensitive hashing, a technique to turn the computationally expensive exact nearestneighbor search problem into an inexpensive approximate solution (it’s a neat trick and I promise you’ll love it). We’ll see how localitysensitive hashing is used in image search, recommendations, and other machine learning problems. And of course, we’ll mention deep hashing, because why not?

Hierarchical softmax
Benjamin Wilson
Wednesday 26th July 2017
more
less
Hierarchical softmax is an alternative to the softmax that scales well in the number of possible classification outcomes. It has been applied in several word embedding models, where the task is predicting a missing vocabulary word, given the context. Indeed, together with negative sampling, it is what has made largecorpus word embeddings computationally viable. It works via a really neat trick, that begins by choosing an arbitrary binary tree whose leaves are the outcomes of the classification. I'll explain to you exactly how it works and explore some questions about the effect of the choice of tree on the classification quality.
Notes from the talk are written up as a
blog post.

Grammatical gender associations outweigh topical gender bias in crosslinguistic word embeddings
Kate McCurdy
Wednesday 5th July 2017
more
less
Recent research has demonstrated that the relations captured in vector space semantic models can reflect undesirable biases in human culture. Our investigation of crosslinguistic word embeddings reveals that topical gender bias interacts with, and is surpassed in magnitude by, the effect of grammatical gender associations, and both may be attenuated by corpus lemmatization.

Minsky and Papert's "Perceptrons"
Benjamin Wilson
Wednesday 7th June 2017
more
less
Perceptrons are simple neural networks with a singlelayer of learned weights. Despite their simplicity, their invention (by Rosenblatt in 1958) brought about a first great wave of optimism for connectionist methods in artificial intelligence. This optimism was brought to a halt by the book of Minsky and Papert, published in 1969, which proved a number of interesting results about the sorts of functions that can and can not be represented by a perceptron. In this talk, we'll review these results and the interesting methods used to prove them (elementary group theory). We'll also take the opportunity to talk also about the perceptron learning algorithm, which uses no knowledge of the error function beyond its evaluation at the training examples  in particular, it doesn't need gradients.
Notes (pages 18)

Learning endtoend optimized lossy compression codecs
Simon Wiedemann
Wednesday 17th May 2017
more
less
At the core of all multimedia technologies we find compression codecs. This are information processing methods that aim to eliminate as much redundancies as possible contained in the raw data of a source (thus, reducing the amount of information needed in order to represent them). In lossy compression schemes, at some point during the process, one quantizes the data values which inevitably results in a loss of information. Thus, in the design of such codecs, one searches for the optimal tradeoff between information loss (or distortion) and size reduction (or rate). However, due to the complexity of most real world data sources (like images and videos), solving the posed ratedistortion problem becomes intractable. Therefore, most of todays codecs (e.g. JPEG, H264, etc.) rely on clever heuristics and methods that make the lossy compression problem amenable.
Nonetheless, in recent years, neural networks have become commonplace to perform tasks that had for decades been accomplished by ad hoc algorithms and heuristics (e.g. object recognition). In fact, recent papers in this field have shown that this powerful class of methods are able to beat standard codecs (like JPEG) for the task of lossy image compression. This results prove their potential to perhaps some day replace the current paradigms of approaching the compression problem.
In the first part of my talk, I will give a basic introduction to information and source coding theory and explain some of the standard heuristics employed in the field of lossy image compression. In my second part, I will explain the approach taken by Ballé et al. (under review for ICLR 2017), which is based solely on neural networks.

A deeper introduction to reinforcement learning
Ludwig Winkler
Wednesday 3rd May 2017
more
less
An introduction to reinforcement learning covering modelbased, modelfree and approximate methods with deep neural networks.

Samplingbased approaches for language modelling and word embeddings
Matthias Leimeister
Wednesday 26th April 2017
more
less
Neural probabilistic language models (NPLM) aim at predicting a word given its context. Using neural networks, those models learn a realvalued dense representation, or word embedding, for each word in the vocabulary that can be used for subsequent tasks. Training an NPLM based on a softmax classifier output for a target word given the context is computationally expensive for large vocabularies, so that various algorithms have been proposed for making the training more efficient. The talk will introduce several samplingbased approximations to the softmax that aim at distinguishing the true target word from a number of noise words, with a focus on noise contrastive estimation and negative sampling (word2vec).
Andriy Mnih and Yee Whye Teh:
A fast and simple algorithm for training neural probabilistic language models. ICML 2012.
pdfTomas Mikolov et. al.:
Distributed representations of words and phrases and their compositionality. NIPS 2013.
pdf

Everybody GANs Now
Charley Wu
Wednesday 15th March 2017
more
less
Generative Adversarial Networks (GANs) are one of the biggest topics in Machine Learning right now. I will give a tutorial on the principles behind Adversarial Training, and how it involves finding a Nash Equilibrium between two competing agents, the Generator and the Discriminator. I will also show that this solution has interesting implications in terms of informationaltheoretic divergence measures (KLDivergence and JensenShannon Divergence), which underly all traditional approaches to computational modeling (i.e., finding a maximum likelihood estimate).
notes

Are RNNs Turing complete?
Benjamin Wilson
Wednesday 1st February 2017
more
less
At the recent NIPS conference, I must have heard it said five times that “RNNs are Turing complete”. All fives time were by Juergen Schmidhuber. At least as many times, I heard other attendees grumbling that this was false. In this talk we’ll explore the assumptions and main ideas of the proof as well as consider whether these assumptions hold for “realworld” RNNs implemented using finiteprecision arithmetic.
If you feel like a headstart, we’ll be covering the paper of
Siegelmann and Sontag “On the Computational Power of Neural Nets” (1995). Notes from the talk are available
here.

NIPS report
Marcel Ackermann
Wednesday 11th January 2017

COLING report
Kate McCurdy
Wednesday 11th January 2017

A simple introduction to Bayesian Learning
Stephen EnrightWard
Wednesday 30th November 2016
more
less
Machine learning often means minimising a loss function by randomly initialising its parameters, then updating them repeatedly in response to batches of training data, for example using gradient descent. The idea behind Bayesian Learning is to replace a point estimate of the parameters with a probability distribution over the parameter space, and replace pointwise updates with Bayes' Theorem, to turn this "prior" distribution into a "posterior", which incorporates evidence from the training set. This talk will be an elementary introduction Bayesian Learning, using simple examples and small amount of theory.
notes

Deep Neural Networks for YouTube Recommendations
Alexey Rodriguez Yakushev
Wednesday 16th November 2016
more
less
A talk on the model and architecture of the YouTube recommendation system (
abstract). It should offer some really interesting insights into the workings of one of the world's most heavilyused recommenders!

Connectionist temporal classification
Matthias Leimeister
Wednesday 9th November 2016
more
less

Predicting incidental comprehension of macaronic texts
Kate McCurdy
Wednesday 19th October 2016

A word embedding model for explaining compositionality
Stephen EnrightWard
Wednesday 17th August 2016
more
less

Audio featurisation and speaker recognition
Benjamin Wilson
Wednesday 27th July 2016
more
less

Collaborative Filtering and the NetFlix Prize
David Yu
Wednesday 6th July 2016
more
less
We'll describe the Netflix Problem and then go through the paper "Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights” by Yehuda Koren and Robert Bell (Bell Labs)  the eventual winners of the competition. If time permits, description of the MapReduce framework and how to apply the approach in the paper to that framework.

Gradient Descent Methods
Benjamin Wilson
Wednesday 22nd June 2016
more
less
We study the rate of convergence of gradient descent in the neighbourhood of a local minimum. The eigenvalues of the Hessian at the local minimum determine the maximum learning rate and the rate of convergence along the axes corresponding to the orthonormal eigenvectors. All this material is drawn from Chapter 7 of Bishop’s Neural Networks for Pattern Recognition, 1995.
notes

Bayesian Dialog management: how to build a chat interface without a flow chart.
Alan Nichol
Thursday 9th June 2016
more
less
Statistical dialogue systems are motivated by the need for a datadriven framework that reduces the cost of laboriously handcrafting complex dialogue managers and that provides robustness against the errors created by speech recognisers operating in noisy environments. By including an explicit Bayesian model of uncertainty and by optimising the policy via a rewarddriven process, partially observable Markov decision processes (POMDPs) provide such a framework. However, ex act model representation and optimisation is computationally intractable. Hence, the practical application of POMDPbased systems requires efficient algorithms and carefully constructed approximations.

Music similarity and speaker recognition what do they have in common?
Matthias Leimeister
Wednesday 11th May 2016
more
less
Music information retrieval has largely been influenced by methods from speech processing. In this talk I want to present a line of research on music similarity models that followed a parallel development in the area of speaker recognition and identification. Specifically I'll discuss a common set of features (MFCCs) and statistical models (single Gaussians, GMM supervectors, and ivector factor analysis) that have been used successfully in both domains.
slides

Factorisation Machines
Stephen EnrightWard
Wednesday 27th April 2016
more
less

Negative sampling for recommender systems and word embeddings
Benjamin Wilson
Wednesday 13th April 2016
more
less
Common to several approaches for training recommenders is contrasting the data observed in a context (e.g. the products purchased by a user) with negative samples chosen from the pool of all possible products. In this talk well review some different ways of choosing negative samples, starting with static approaches like uniform and popularity based sampling and moving through to adaptive approaches (like BPRs adaptive oversampling and WARP) that take into account the context and the models current belief. All of these ideas can be transferred to the world of word embeddings, where the negative sampling techniques are at present less sophisticated.
References:
 Steffen Rendle and Christoph Freudenthaler, Improving Pairwise Learning for Item Recommendation from Implicit Feedback, 2014, pdf my notes
 Jason Weston, Samy Bengio and Nicolas Usunier, Wsabie: Scaling Up To Large Vocabulary Image Annotation, 2011 pdf my notes

Gaussian Process Models: what to do when you can’t optimize?
Charley Wu
Wednesday 16th March 2016
more
less
Optimization problems are ubiquitous in machine learning, but what do we do when we cant optimize? When the problem space is too large, search costs are expensive, or it is not feasible to exhaustively search through all possible solutions, it becomes impossible to find a guaranteed optimal solution. Instead, we need to find the environmentappropriate balance between exploration and exploitation in order to produce a satisfactory solution, within a finite search horizon.
I will be giving a tutorial on Gaussian Process (GP) models as a method to model Bayesian beliefs about unexplored areas of the problem space. GPs give us both the expected value and uncertainty of each location in the problem space, based on knowledge from previous observations. With a belief model of the environment, we can treat the optimization problem as a bandit problem, with an exploration goal (reduction of uncertainty) and an exploitation goal (closer approximation of the global optima).
Lastly, I will also introduce a bit of the literature on human explorationexploitation behavior, and how untrained undergrads in a nonmathematical discipline are better and more efficient than the best stateofthe art optimization algorithms. What makes us good at this task and how can we use this information to build better search algorithms?

Active evaluation of Predictive Models
Christoph Sawade
Wednesday 2nd March 2016
more
less
In order to make an informed decision about the deployment of a predictive model, it is crucial to know the model’s approximate performance. To evaluate performance, a set of labeled test instances is required that is drawn from the distribution the model will be exposed to at application time. In many practical scenarios, unlabeled test instances are readily available, but the process of labeling them can be a time and costintensive task and may involve a human expert.
This talk addresses the problem of evaluating a given predictive model accurately with minimal labeling effort. We study an active model evaluation process that selects certain instances of the data according to an instrumental sampling distribution and queries their labels. We derive sampling distributions that minimize estimation error with respect to different performance measures such as error rate, mean squared error, and Fmeasures.
Another instance is the problem of evaluation the performance of a ranking function. In practice, ranking performance is estimated by applying a given ranking model to a representative set of test queries and manually assessing the relevance of all retrieved items for each query. We apply the concept of active evaluation to ranking functions and derive optimal sampling distributions for the commonly used performance measures Discounted Cumulative Gain (DCG) and Expected Reciprocal Rank (ERR).

Hierarchical SelfOrganising Maps and Random Swapping for 2d Sorting
Kai Uwe Barthel
Wednesday 17th February 2016
more
less
There are many techniques for dimensionality reduction used to visualize highdimensional datasets. A typical approach is principal component analysis, however a linear PCA projection cannot preserve local pairwise distances. Other approaches are multidimensional scaling, selforganizing maps (SOMs), local linear embedding or tDistributed Stochastic Neighbor Embedding (tSNE). In most cases the highdimensional datasets are nonlinearly projected and shown as points on a 2D map. If images are to be arranged in a 2D fashion, there are further constraints. The images should not overlap and the 2D display should be equally filled. I will talk about different approaches how to visually sort/arrange images using hierarchical SOMs and directed random swapping approaches. One example of such a visually sorted image display is picsbuffet.com which is a visual image browsing system to visually explore and search millions of images from stock photo agencies and the like. Similar to map services like Google Maps users may navigate through multiple image layers by zooming and dragging.

Counterfactual Performance Estimation or “how can your ML models have a Groundhog Day”
Alexey Rodriguez Yakushev
Wednesday 3rd February 2016
more
less
I am currently very interested in evaluating new recommendation models without having to perform A/B testing every single time. So I have been reading a lot on how to properly use log data to evaluate model performance.
It is very common to use log data to train and perform model selection/hyperparameter search of your models, think for instance of click through rate prediction in which you train models on your users’ click stream, or recommendation models which you train using the consumption histories of your users. Log data is relatively easy to get if you have a large user base but it has serious pitfalls.
Log data usually has only partial feedback, you know that a user clicked or not on an ad, but you don’t know whether she would have clicked or not on a different ad. This bandit like feedback will introduce bias in your evaluation metrics and will likely give you a distorted view of what will happen at A/B test time.
We will use Counterfactual Estimators to ask “what if” questions, that is, what would have been the performance if you use a different model than used at logging time. Of course there are no miracles, we will look at limitations of these estimators and their theoretical guarantees.

An intuitive approach to clustering that works … sometimes.
Vinzenz Schönfelder
Wednesday 20th January 2016
more
less
This talk presents a clustering method proposed in a recent Science paper. Generally favouring pragmatic approaches to problem solving (as a physicist…), I immediately fell in love with this approach when I learned about it from my colleagues at SISSA. Rodriguez and Laio devised a method in which the cluster centres are recognised as local density maxima that are far away from any points of higher density. The algorithm depends only on the relative densities rather than their absolute values. The authors tested the method on a series of data sets, and its performance compares favourably to that of established techniques. Looking at the method more closely, however, reveals essential weaknesses that depend on apparently minor details in the distribution of the data.
Rodriguez, A., Laio, A. (2014). Clustering by fast search and find of density peaks. Science, 344(6191), 1492–1496.

Improving Distributional Similarity with Lessons Learned from Word Embeddings
Marcel Ackermann
Wednesday 6th January 2016
more
less

Kernels – Duality between Feature Representation and Similarity Measures
Christoph Sawade
Wednesday 2nd December 2015
more
less
The questions of ‘how to represent data points’ and ‘how to measure similarity’ are crucial for the success of all methods in data science. Kernel functions can be seen as inner products in some Hilbert space and thus connect feature encoding with the concept of similarity. In this talk, I will elaborate this relationship in the regularized empirical risk framework, which captures lots of standard algorithms like Support Vector Machines, Ridge Regression and Logistic Regression. The goal is to develop a general understanding that is applicable throughout the machine learning world.

MCMC, a space odyssey!
Simon Pepin Lehalleur
Wednesday 18th November 2015
more
less
Computing integrals in highdimensional spaces is a ubiquitous and challenging problem: traditional numerical quadrature methods fail miserably, because they sample the space uniformly while the integrand is usually supported on a very small volume. Markov Chain Monte Carlo is a family of algorithms which solve this problem by sampling more intelligently, using a Markov chain whose stationary distribution is the one we want to integrate. I will explain the basic ideas, including some of the many ways to set up a suitable Markov chain (MetropolisHastings, Gibbs sampling)
References:
Part 1 and
Part 2 of M. Bethancourt 2part course in MLSS 2014.
Survey of the aspects of Markov chains which bear on MCMC.
Handbook of MCMC, esp. chap. 1 for a general overview and 5 for Hamiltonian MCMC.

Limiting Distributions of Markov Chains
Benjamin Wilson
Thursday 5th November 2015
more
less
A seminar on limiting distributions for (finitestate, timehomogeneous) Markov chains, drawing on PageRank as an example. We see in particular how the “random teleport” possibility in the PageRank random walk algorithm can be motivated by the theoretical guarantees that result from it: the transition matrix is both irreducible and aperiodic, and iterated application of the transmission matrix converges more rapidly to the unique stationary distribution.
Notes and further reading available
here.

Topological data analysis and the Mapper algorithm
Stephen EnrightWard
Thursday 5th November 2015
more
less
Giving a sensible visualisation of point clouds in high dimensions is a difficult problem in data science. Many traditional approaches fit point clouds to surfaces, trying to preserve distances. In the last ten years, people have started to use methods from topology, a branch of mathematics in which “distance” is replaced with the spongier notion of “in the neighbourhood of”, to model data points. The aim is to capture the global shape of the data — the signal — while ignoring local noise. I’ll explain one such approach, called the Mapper algorithm, developed by Singh, Mémoli and Carlsson in
this paper

Parameterising the Mahalanobis distances
Benjamin Wilson
Wednesday 21st October 2015
more
less
Given pairs of points that are known to be similar, and pairs that are known to the dissimilar, can we learn a metric such that similar points are close together? Well consider the special case of the Mahalanobis distance, which is just the Euclidean distance after applying a linear transform. Posing this problem in different ways leads to very different optimisation problems. Ill share my first steps working on this. The introduction of
this survey paper gives a good introduction to the topic.
Notes from the talk are
here.

Report on EMNLP 2015
Kate M.
Wednesday 7th October 2015

Report on RecSys 2015
Marcel Ackermann
Wednesday 7th October 2015

Limiting Distributions of Markov Chains
Benjamin Wilson
Tuesday 6th October 2015
more
less
A seminar on limiting distributions for (finitestate, timehomogeneous) Markov chains, drawing on PageRank as an example. We see in particular how the “random teleport” possibility in the PageRank random walk algorithm can be motivated by the theoretical guarantees that result from it: the transition matrix is both irreducible and aperiodic, and iterated application of the transmission matrix converges more rapidly to the unique stationary distribution.
Notes and further reading available
here.

Pocketsized Neural Networks
Adriaan Schakel
Tuesday 8th September 2015

RNNs for Language Modelling
Stephen EnrightWard
Tuesday 25th August 2015
more
less

Nonnegative Matrix Factorization for Audio Source Separation
Matthias Leimeister
Tuesday 11th August 2015
more
less
Audio source separation deals with the problem of decomposing a mixture of sound sources into their individual parts. Typical examples include noise suppression in speech signals or the extraction of single instruments from a music mix. A popular class of algorithms to approach the problem is nonnegative matrix factorization (NMF) and its extensions. In the talk I will review the basics of NMF for source separation following a recent survey paper [1] and present some concrete examples for speech and music [2].
[1]
http://cal.cs.illinois.edu/papers/virtanenspm2015.pdf[2]
http://www.cs.tut.fi/~tuomasv/demopage.htmlSlides from the talk are available
here.

Interpreting word embedding arithmetic as set operations
Alexey Rodriguez Yakushev
Tuesday 28th July 2015

Vector length as significance in word embeddings
Benjamin Wilson
Tuesday 28th July 2015
more
less
An experimental approach to studying the properties of word embeddings is proposed. Controlled experiments, achieved through modifications of the training corpus, permit the demonstration of direct relations between word properties and word vector direction and length. Written up
here.

Attentional models for object recognition in Recurrent Neural Nets.
Dave Kammeyer
Tuesday 14th July 2015
more
less
A discussion of an attentionbased model for recognizing multiple objects in images, from
this paper.

Sparse L1regularization for Feature Extraction
Vinzenz Schönfelder
Tuesday 16th June 2015
more
less
I present the general idea behind regularization in general and its sparse incarnation in particular. As a motivation, I start with the project I worked on during my dissertation. It represents a practical application of sparse regularization for feature selection in the auditory perceptual sciences. By discussing geometric interpretations of L1regularisation as well as its relation to sparse Bayespriors, I hope to provide an intuitive understanding of the underlying mechanism. Finally, I discuss standard methods for hyperparameter optimisation in regularised regression.
Slides are available
here.

Expectatation Maximisation and Gaussian Mixture Models
Benjamin Wilson
Thursday 21st May 2015
more
less
A talk on Expectation Maximisation where Gaussian Mixture Models are considered as an example application. The exposition follows Bishop section 2.6 and Andrew Ng’s CS229 lecture notes. If you weren’t at the seminar, then it is probably better to read one of these instead. Another useful reference is likely the 1977 paper by Dempster et al. that made the technique famous (this is something I would have liked to have read, but didn’t).
Notes are
here.

Convolutional Neural Nets NLP
Adriaan Schakel
Wednesday 22nd April 2015

Nonnegative Matrix Factorisation
Benjamin Wilson
Thursday 12th February 2015
more
less
A talk on nonnegative matrix factorization (NMF), its probabilistic interpretation and the optimization problems it poses. I find nonnegativity constraints really interesting from the point of view of model interpretability, and NMF is a famous example. Most of us will have see the example of facial image decomposition using NMF before. If you wanted to read something yourself, you could start with
Lee and Seungs paper in Nature.

Gaussian processes
Margo K.
Monday 15th December 2014

Negative sampling and noise contrastive estimation in the context of probabilistic language modeling
Adriaan Schakel
Monday 8th December 2014

Deep Belief Nets
Stephen EnrightWard
Monday 24th November 2014

Energybased models, inference and Gibbs Sampling
Benjamin Wilson
Monday 27th October 2014
more
less

Latent Dirichlet Allocation
Marija V.
Monday 20th October 2014

Estimator theory and the bias/variance tradeoff
Benjamin Wilson
Monday 11th August 2014
more
less

Dependentlytyped programming languages
Andreas G.
Friday 8th August 2014
more
less
The presentation will be an introduction to dependently typed programming languages. I hope to give a brief glimpse into how advanced type systems allows the programmer to be more explicit in the way he expresses properties of his programs. In fact, the system is rigorous to the point where we can represent logical propositions by types, and further prove our proposition by writing a program of the corresponding type.

Restricted Boltzmann Machines
Stephen EnrightWard
Monday 21st July 2014

Feasibility of Learning
Marija V.
Monday 30th June 2014
more
less
The VC dimension and learning feasibility.

Mathematical properties of Principal Component Analysis
Benjamin Wilson
Monday 19th May 2014
more
less
Here are some
notes from the talk.

Transfer learning
Heiko Schmidle
Monday 28th April 2014

Word2vec and the Distributional Hypothesis
Benjamin Wilson
Monday 14th April 2014
more
less
An overview of word2vec, covering the CBOW learning task, hierarchical softmax, and the fundamental linguistic assumption: the distributional hypothesis.
Here are better slides from a later talk on the same subject.