The Michigan Student Artificial Intelligence Lab (MSAIL) is a student organization for discussion of artificial intelligence and machine learning. Andrew Ng said:
“ ...if you read research papers consistently, if you seriously study half a dozen papers a week and you do that for two years, after those two years you will have learned a lot... But that sort of investment, if you spend a whole Saturday studying rather than watching TV, there's no one there to pat you on the back or tell you you did a good job. ” — Andrew NgMSAIL is a community in which motivated students can read and discuss modern machine learning literature together. We welcome students of all backgrounds and ability. To join MSAIL and stay up to date, simply join our Slack team! Also be sure to check out our sister organization: the Michigan Data Science Team! We are both graciously sponsored by the Michigan Institute for Data Science.
Gradient descent is a familiar algorithm. However, it is more general perhaps than many would consider. We explore theoretical bridges between gradient descent and other problem settings such as bandits and clustering, and describe Mirror Descent, a generalized algorithm which unifies these seemingly distinct settings. Our analysis of Mirror Descent will provide useful bounds across the presented variety of optimization problems.
While deep learning is revolutionizing our view of what computers can do, another branch of the “more-traditional” machine learning, tree based method, is also rapidly transforming industries and research. In fact, they are used just as much as deep learning in cutting edge technologies including self-driving and quant trading. Tree based methods are faster to train, usually with a smaller model size and capable of dealing with sparse, incomplete and irregular data. We’ll dive deep into gradient boosting trees, in particular two implementations, XGBoost and LightGBM to discuss the principles behind the uses and fast (as well as parallel) implementations.
Come and enjoy a talk by Prof. Reetu Das on some of the recent advances in hardware for machine learning!
Algorithms already organize aspects of society, policy, and behavior, and will continue to become more entrenched in our physical worlds. We care about the technical side of AI because we are inherently interested in the social implications of it. This session will focus on one unexpected outcome of AI. Algorithms are typically considered to be more neutral and unbiased than humans and in some cases have displaced human responsibility for their outcomes. However as Claire Miller stated, “software is not free of human influence. Algorithms are written and maintained by people, and machine learning algorithms adjust what they do based on people’s behavior. As a result...algorithms can reinforce human prejudices”. Hence, we will dive deeper into how one can reduce bias in algorithms.
Neuroscience and artificial intelligence (AI) are intrinsically related: although both are broad, they share pursuit of long-standing questions. What mechanisms may give rise to intelligent agents? How can an understanding of these mechanisms be used to improve the world? Although the questions are similar, the approaches aren’t necessarily. Many factors can make it unclear how relevant greater understanding of the respective sister field is. To address this uncertainty, this talk and discussion seeks to critically examine the similarity and dissimilarity of neuroscience and AI, using vision and convolutional neural networks (CNNs) as a case study.
Alan Turing’s 1950 paper on AI explores the idea of computational creativity and tries to answer the question of, whether computers can create. He counters the common view that “machines cannot give rise to surprises” by tracing it to a “fallacy to which philosophers and mathematicians are particularly subject”, namely, the triviality of deduction. Indeed, it is because “there is [virtue in the working] out of consequences from data and general principles” that machines might surprise us.” As predicted by Turing, machines are now learning to create. The degree of originality present in the machine output is a good topic for debate, as is overall promise of the general approach of learning creativity. In this talk, we consider one specific creative system: Google’s Magenta.
Our human history has been an exercise in automation. Each pattern in desired behavior enables an effort-saving machine. For instance, while an ancient potter might have independently warped each individual grain of clay, modern potters specify much less data; leveraging the desired pattern of rotational symmetry, a pottery wheel turns a 1-D profile into a whole pot. Likewise, GIMP allows manipulation of images in terms of patterns such as digital paint brushes rather than individual pixels. Patterns have made data entry easy. But computer science has taken automation further. Whereas many disciplines design machines and programs, machine learning is about designing programs to design programs. If patterns enable machines, then machine learning seeks to automatically identify patterns and hence automatically build machines. Often, this is done with a hand-crafted objective function in mind --- one that rates how well prospective patterns fit data. But even the selection of objective functions is being automated. This is the idea of Generative Adversarial Networks. Our discussion will briefly review the original paper with example applications (Goodfellow), turn to to a modern variation (Arjovsky), then linger on potential limits to the seeming magic of GANs (Arora).
How can children learn their native language given seemingly so little data? To what extent are cognitive mechanisms like language and mental physical simulations hard-coded in genetics and epigenetics; and to what extent are they learned? This week, we will survey the current viewpoints on human learning; as argued by Lake 2016, this psychological understanding can and should inform the design of artificial “minds”.
Come and enjoy a talk by Prof. Jason Mars!
Since the Delphic oracle, we’ve wondered: what happens next? In other words, we hoped to extrapolate sequences: to learn functions from sequences to sequence elements. A generalization of this is time-series classification: learning functions from sequences to sets. For instance, one might want to classify each transcripts in a grade dataset as belonging to a future scientist or not. Or, one might classify heartbeat patterns as normal or abnormal.
Several decades have passed since the first Neural Network was proposed. Since then, architectural ideas have proliferated: think of convolution, gating mechanisms, skip connections, and more. But, while neural network architecture has been traditionally handcrafted, some recent efforts aim to partially automate this process (more like optimizing network architecture on the dataset). Some early efforts in this direction show that machine-designed networks outperform hand-crafted networks in vision benchmarks such as CIFAR-10. We will discuss some of the recent advances!
Why should SGD even converge when you are training Neural Networks? The choice of optimization algorithm proves to be important, but we haven’t made much fundamental breakthroughs beyond SGD... until recently. To understand why SGD converges, we need to understand the framework of online learning and how this line of thought can be generalized to more complicated methods. Remember, ML = Probability (or Model) + Optimization. To understand ML, we have to understand when, how, and why SGD (and its cousins) works.
Deep networks appear to go from questions to answers. But how do they do so? This week, we will ponder that question both from a forward view (which asks, given a question, what chain of reasoning might lead to an answer) and from a backward view (which asks, given a question and an answer, what chain of reasoning led between them). To the extent that deep neural networks think (which they do not yet do), the models we will discuss thus think about thinking! Dr. Marcus Rohrbach from Facebook AI Research will join us for this special meeting and discuss his research and its connection to reasoning.
Most machine learning algorithms seek to detect correlations between variables. However, correlation does not imply causation Thus, to avoid concluding that ice cream causes summer or that chocolate consumption accelerates research, we must rethink our models. This leads us to the ideas of structural causal models and propensity scores. In fact, even when we do not seek causal conclusions, we may need to model them. Indeed, unless we acknowledge that variables under study might affect a datapoint’s visibility, we might think that the refrigerator light is always on! Thus, we must correct for selection bias. This week, we will overview why and how we can incorporate the idea of causation into our algorithms!
Capsule networks are a new development proposed by Geoffrey Hinton that is aimed at correcting some of the major flaws of modern CNN's used in the field today. The fundamental drawback of the CNN is that it does not consider the hierarchical relationship between the lower and higher level features that it detects. Though pooling is used to correct this flaw and help the CNN focus on certain areas, Hinton regards this as a "disaster". Instead, Hinton proposes (for the purpose of object detection and classification) to incorporate the relationships between objects to preserve the hierarchical pose of objects. This is done through the use of capsules. But does this actually work? If so, how do you train such a network? Join us for the answers.
Describing an image in natural language is hard. Especially for computers. This task is arguably AI-complete, because describing images requires not only perception of objects but also perception of their interrelationships, having an understanding of background knowledge of the world and so on. Given a certain image there a plethora of questions that one could ask that a computer would struggle to provide answers for. This week we will focus on systems and approaches that bridge the gap between Vision and NLP to provide useful and descriptive captions to a variety of images.
Can computers produce something really new? Though one might perceive creativity in the strategies discovered by reinforcement learning Atari-players or in the behavior of image-valued generative adversarial networks, critics such as Douglas Hofstadter argue that the deep learning framework misses something essential. Modern models do rely on linearity for generalization (which can be thought of as finding word-to-word analogies in the language domain) rather than trying to find semantic relationships. Hence, Hofstadter proposes instead to isolate and study what he argues is cognitions core: abstract analogy-making. To this end, this weeks papers explore systems for far-reaching analogy induction in symbolic domains.
Reinforcement learning is the study of learning intelligent behavior rather than only pattern recognition. RL has seen many successes with deep learning, such as AlphaGo. However, gradient-based optimization may struggle when the reward signal is sparse. Evolution Strategies do not suffer from sparse rewards and, relative to gradient-based RL, are easier to implement, depend on fewer, hyperparameters, scale better in a distributed setting, and avoid other difficulties of gradient-based RL. ES could be an important part of building general artificial intelligence.
Let's enjoy an AI-themed movie before dispersing for Thanksgiving break! Shall we watch "2001: A Space Odyssey"? Or perhaps "WarGames?" Or "Chappie"? Or "The Matrix"? Vote on Slack's #random!
MSAIL is going to attend EECS' "Ada Lovelace Opera: A Celebration of Women in Computing"! It's free and includes lightning talks and a performance of Tchaikovsky's Enchantress. If you're interested, pre-register and join the #ada channel on Slack.
Convolutional Neural Networks (CNNs) have had great success in a number of challenging fields including speech recognition and computer vision. Despite this success, experts still do not know why they work so well. This means improving a CNN is a result of trial-and-error, which does not give researchers a lot to work with. The paper that we will discuss takes a step towards understanding how exactly CNNs work. Specifically, it looks into inverting CNNs, i.e. reconstructing the input to a CNN based on hidden activation. In order to show mathematically that CNNs are invertible, the paper uses ideas from Compressive Sensing, a subfield of signal processing that deals with acquiring and reconstructing signals using much fewer samples than traditional signal processing.
Subspace clustering assumes and exploits such union-of-subspaces structure to discover rich underlying patterns. Like ordinary clustering, subspace clustering can be unsupervised or, aided by some hints, semi-supervised. Our main reading this week shows how subspace clustering can request for and incorporate oracle-provided hints; in other words, the paper proposes to an Active Learning method for subspace clustering. One of the authors of this paper will be the guest speaker this week: John Lipor.
Neuromorphic (brain-inspired) computing is an emerging technology that attempts to create chips that more closely mimic the brain. Memristors are a fairly recent type of hardware component (since 2008) that functions like a transistor, but also remembers the amount of charge that previously flowed through it. They also do this without power. Will using chips that more closely mimic the brain be the key to brain simulation and locally learnable models without the need for large amounts of pre-trained data? Come and find out!
Healthcare, healthcare, everywhere! MSAIL tends to focus on Machine Learning as an intellectual discipline whose main current applications are fun demonstrations. But ML is used in real life, too. One of its most inspiring and important applications lies in analyzing the deluge of data streaming out of medical journals, hospital records, and real-time patient measurements. However, given the huge variety of techniques applied to the field, how does one choose the optimal model while keeping credibility and actionability in mind?
How do we do Bayesian inference on models such as the Hierarchical Dirichlet Processes seen in past weeks? Markov Chain Monte Carlo is asymptotically exact but can require a prohibitively large number steps. Variational Inference promises to solve a subset of these Bayesian inference problems more efficiently albeit approximately. However, the common formulation of VI known as Mean Field Variational Bayes fails to accurately estimate its level of uncertainty and correlations between variables. The Giordano and Broderick paper proposes a fix inspired from Statistical Physics.
Part B will continue the basics developed in Part A by experimenting with depth, a key characteristic of Deep Learning. Depth just means using function composition to our advantage. Two great examples of how this affects architecture lie in Recurrent Networks and Feature Learning. Both of these ideas pop up in language models, so our data next week will be extracts of The Simpson's scripts.
The promise and challenge of reinforcement learning (RL) lies in the goal of learning complex behaviors from sparse feedback. This paucity of feedback has led Yann LeCun to label RL the "cherry on top of the cake" of Machine Learning: an inconsequential decoration that does not address deeper issues of generalization. Yet, by combining neural networks originally conceived for function approximation (i.e. supervised learning) with RL updates, deep RL has achieved fabulous success.
Deep Q-networks (DQNs) learn policies from high-dimensional experience via end-to-end differentiable RL. DeepMind has shown that DQN agents, receiving only a stream of pixels and game scores as inputs, surpass the performance of all previous algorithms and often match human performance in a suite of 49 Atari 2600 games. DQN is thus the first artificial agent capable of learning to excel at a tasks as diverse and challenging as Alien War and Igloo-Building.
Join us this Tuesday to discuss the techniques and significance of Deep RL. Is RL just a cherry on top of the cake, or is it a scaffold without which the cake has no intelligent form?
This tutorial series aims to teach beginners the fundamentals of TensorFlow.
Neural networks are dense, parametric, and continuous, while language is sparse, non-parametric, and discrete. So how can the former process the latter?
Famously, one uses one-hot embeddings and softmax sampling to translate between continuous and discrete domains. One uses word embeddings to represent sparse sets of words as dense clouds of semantic vectors. One use recurrent neural networks to reduce variable-length sequence problems to local, parametric ones.
But there has been another breakthrough recently: one can use Attention Mechanisms to model long-distance relationships between words! Attention lies at the core of this week's papers.
k-means is the root of all clustering algorithms. It generalizes in the following ways: (a) soft cluster assignments, (b) variable (i.e. inferred) numbers of clusters and, (c) cluster sets instead of points (e.g. cluster documents considered as bags of words). Hierarchical Dirichlet Models incorporate all 3 types of fanciness and provide a principled, interpretable way to cluster and analyze documents within a text corpus.
The past seven decades have seen wave after wave of Artificial Intelligence nostra (that is, pet schemes or favorite remedies). Just think of Perceptrons, Symbolic Methods, Expert Systems, Probabilistic Graphical Models, and Deep Learning. While none have proven able to model general intelligence, each brings its own advantages and limitations. The latest wave is Deep Learning. Join us and the ever-excellent Chengyu Dai as he speaks on his summer experience at Princeton studying modern Deep Learning from a theoretical point of view.
Why does science work? Specifically, if a model does well on the training set, will it do well on a test set of never-before-seen examples? In other words, which models generalize? Amazingly, we can derive non-trivial answers to this question. For example, it would be nice to say that "simple models generalize" --- this is Occam's razor, a pillar of both natural science and "data science". It turns out we can give "simple" and "generalize" precise yet motivated meanings such that Occam’s razor becomes provable. This week's papers define and relate a complexity measure (VC dimension) to a notion of generalization (PAC learnability).