Keynote Talk: Model Based Machine Learning

Keynote Talk: Model Based Machine Learning


>>It’s my absolute pleasure
to introduce Chris Bishop. Chris Bishop is
a Microsoft Technical Fellow. He also is the
Managing Director of our Cambridge Research Lab. He’s also Professor
of Computer Science at the University of Edinburgh and a Fellow of the Darwin
College Cambridge. In 2014, he was elected Fellow of the Royal
Academy of Engineering. And in 2007, he was elected Fellow of the Royal
Society of Edinburgh. And in 2017, he was elected as a Fellow
of the Royal Society. It is a long list ofachievements
and accolades that I can go on talking about, right rather late, Chris’ talk
will do the talking. I’m sure he won’t disappoint. So, without much further ado,
Chris, all yours.>>Thank you very much. Thanks for the
invitation to come here. It’s a great privilege
to be the final speaker. I thought what I do
for this talk is, rather than talk about any particular application
or particular algorithms, is to step right
up to 50,000 feet. I think about machine learning, and what it is all about? What are we trying to achieve? And in particular, to
give you a perspective on machine learning that I call Model-Based Machine Learning, which you can think
of as a compass to guide you through
this very complex world. So, Machine Learning can
be very intimidating. There are many, many algorithms. Here are a few. Every year, next hundreds more
are published. You’ve heard about lots today. And especially, if you’re
a newcomer to the field, it’s bewildering,
it’s intimidating. Which ones you need
to learn about? Which ones should you
use for your application? It can be really challenging. So, Model-Based
Machine Learning is just a perspective
that I hope will help go into on your journey
through machine learning, whether you’re working on new algorithms or in particular, whether you’re working on
real-world applications. So, coming back to all
of these algorithms, you might be a little
frustrated and say, “Why do I have to learn about hundreds or thousands
of different algorithms? Why can’t these machine
learning people just come up with the one
universal algorithm?” In fact, maybe they have,
maybe steep neural networks. “Have steep neural networks solve all of
humanity’s problems. I don’t need to learn
about the rest.” Well, there’s a mathematical
theorem, it’s proven. So, it’s unlikely to
be retracted anytime. And it’s called the ‘No
Free Lunch’ Theorem. And it’s by Daniel Wolpert
back in 1996. This is the averaged overall possible
data-generating distributions. You can think about
as no averaged overall the possible problems you
could ever want to solve. Every classification
algorithm has the same error rates when classifying previously
unobserved points. That means if an algorithm is particularly good
at one problem, it will be particularly
bad at some other problem. Put it another way,
there is no such thing, as a universal machine
learning algorithm. That is not my personal opinion. It’s a mathematical theorem. Well, to put it
another way, the goal of machine learning is not to find the universal algorithm
because it doesn’t exist. But instead, to find
an algorithm that is in some sense well matched to the particular problem that
you’re trying to solve. Okay, so this is
very fundamental, very cart of machine learning. So, machine learning, we
all know, depends on data. But we cannot learn
just from data. We need to combine data with
something else with a model or we can think of this as constraints or we can think
of it as prior knowledge. I’ll use the term
prior knowledge, but people call it lots
of different things. You cannot learn
from data alone. Otherwise, we’ll have a sort
of universal algorithm. So, we need to combine data with this prior knowledge in
order to make any progress. Now, we also know,
and strictly from the recent developments in deep learning that the more data
you have, the better. And in some sense, if you have lots of data, you can get away with
a little bit of prior knowledge. Or conversely, if you’re in a world where you have
very limited data, then you need to complement that with a lot of
prior knowledge, very strong assumptions about the problem you’re
trying to solve. Now, what’s interesting is the meaning of
this vertical axis. What do we mean
by a lot of data? So, this is a really
important point. I want to talk about big data and what we mean by the size of a data set because there are two completely
different meanings to the size of the data set. It is very important not
to get them confused. There’s the computational size, which is just how many bytes
does it take up on disk. And there’s
the statistical size, which relates to
its information content. So, we illustrate this with a couple of,
sort of corner cases. So, the first example, imagine we have
a block of metal. We apply a voltage and the current flows through
the block of metal, and we’re going to
measure how much current flows when we apply
a particular voltage. And we’ve got
seven measurements here. As we’ve applied
seven different voltages, we’ve measured
the corresponding values of current and our goal
is to generalize. This is a machine
learning problem, and so, our goal is to predict the
current for some new value of voltage on which we haven’t
yet made a measurement. Now, this case some kind and friendly physicists
has come along and told us about Ohm’s Law. An Ohm’s Law just says, current is proportion
to voltage. It’s a straight line
through the origin. The only thing we have
to learn is the slope. The data points I have shown
have measurement errors. These are real-world
measurements. They’re a little bit noisy. If they weren’t
noisy, one data point will determine
the slope exactly. But the data points
are noisy and there’s only a finite
number of them. And so, we don’t know
the slope exactly. But if we’ve got
seven measurements, and the noise is not too high, we can be pretty confident
about that slope. There’s not
very much uncertainty. This is a data set,
which is computationally small because it’s seven pairs
of floating point numbers. So, computationally,
it’s a tiny data set, but statistically, it’s
very large data set. In other words, if I gave you another million measurements
of currents and voltage, then your uncertainty
on the slope will get a little bit smaller, but it’s already very small. So, the next billion data points are not going to make
a lot of difference. You’re already in
the large data regime from a statistical point of view. Think about another corner case. Imagine, we’re going
to have some images. I’m going to label
the images according to the object that they contain. So, it might be
airplane, car, and so on. And these images
might have millions of pixels that are occupying many megabytes
each on your disk. And we might have
a billion images of each class, a billion examples of airplanes and a billion example
of bicycles. So, this is going to take up
a huge amount of disk space. So, this is a data set, which is computationally very large, big data, in the usual sense. But what about statistically? Well let’s imagine,
I’m naive and I just treat these images as vectors and feed them into my favorite, whatever neural network
as a classifier. If you think about the airplane, the airplane could be
anywhere in the image, that’s two degrees of freedom. Actually, it can be
any distance as well. So, three degrees of
freedom of translation, three degrees of
freedom of rotation. Your planes come in different
colors, different shapes, different illuminations,
but all of these degrees of freedom can be taken together
combinatorically. So, if I showed you
one image a second, and you’ll all agree that
every image was an airplane, how long before I
run out of images? Well the answer is, far longer than the age
of the universe. I mean the number of images
that we all agree are airplanes is vast compared to the number of
electrons in the universe. So, if you just have a very naive approach to
classifying these objects, then even a billion images of each class is a
tiny, tiny data set. So, it is computationally
large data set that is statistically small. I’ll just go back for a second
to the previous picture. This refers not to the
computational size of the data, but to the statistical size. So, that’s the concept of
prior knowledge in the data and the concept of
the size of the data set. So, coming back to
this problem then of which algorithm
am I going to use? How am I going to address
this problem of just thousands, of thousands of
different algorithms? So, I want to introduce
you to the philosophy, if you like, with
Model-Based Machine Learning. But it’s a very
practical philosophy. So, the idea of this is not to have to learn
every algorithm there is. It is not to try out every algorithm and empirically
see which works best. The dream of Model-Based
Machine Learning is instead to derive the appropriate machine learning algorithm
for your problem. Essentially, by making
this prior knowledge explicit, which I will show you how
that works in a minute. So, traditionally we say,
How do I map my problem onto one of the
standard algorithms? And often, that’s not
clear and so, typically, people will try out lots
of different things, they try decision
trees and nets, and small vector
machines, and so on. Instead, in the
model-based view, we say, What is the model that
represents my problem? What is the model that
captures my prior knowledge? And so, by forcing
ourselves to make these prior
assumptions explicit, we have a compass to guide us to the correct algorithm or
these sets of algorithms. So, the idea is
the Machine Learning Algorithm is no longer
the first class citizen. Instead, it’s the model. It’s the set of assumptions, and there are set of
assumptions that are specific to the problem
you’re trying to solve. So, if your problem,
you’ll have one set of a problem, set
of assumptions, you’ll have a different set
of assumptions, you will arrive at
different algorithms. And that’s why
there’s no such thing as a universal algorithm. The algorithm that it’s tuned to the particular problem
we’re trying to solve and that’s reflected
in this domain knowledge, these assumptions
as prior knowledge. So, we take the model, the prior knowledge, we combine it with an inference method. The inference methods
tend to be fairly generic. So, the inference methods with things like, gradient descent. That’s in your net.
So, expectation of propagation if we’re
looking at graphical models. General techniques for
optimizing or computing the posterior distribution
of parameters of a model and together they define the machine
learning algorithm. So, the dream is, you write down explicitly
your assumptions. You choose an appropriate
inference method and then you derive
the machine learning algorithm. And when you apply
it to your problem, it will be widely successful. So, that’s the dream. Now,
we’re not entirely there yet. But I’ll show you
some great examples. Let’s talk a little bit about the assumptions
that go into models. If you will get
deep neural net, you’ll think, well, they’re not
making any assumptions, they’re just generic universal machine learning algorithms, you pour data you wanted, the magic comes out the other. So, where are the assumptions
in the neural net? So, let’s look at, if you like the simplest neural
net algorithm I suppose it’s
logistic regression. This is making a very,
very strong assumption. This is a lot of
prior knowledge. That prior knowledge histogram,
that’s sort of it’s high, because it’s restricting us
to very, very narrow domains. It’s making very
strong assumptions. It’s saying that
the prediction Y is some linear combination of the inputs passed through
some simple model nonlinearity. That’s a very, very
specific model, and if we have multiple outputs
at the same time, then we arrive at
a single layer, neural net. That’s like lots of logistic regressions happening
and all at the same time, and that of course
was the type of model that people were
excited about in the first wave of neural nets in the days of the
Perceptron and so on. The second wave of excitement of neural nets in
the late 1980s, early 1990s, was when back-propagation
came along, we learned to train
two-layer nets, in which these features themselves could be
learned from data, and it was a very exciting time. I actually made
a crazy decision, which was to abandon what was
a very successful career in physics because I’d read Geoff Hinton’s paper on
backprop and I thought, “Wow, machines that can learn
artificial intelligence. This is the future.”
And I gave up my career. I persuaded my boss
to buy me a computer. I taught myself to program, I have never done that before, got some C code and
started hacking away. So, that was the second phase of excitement
around neural nets. Then of course, they
went away again. They didn’t really go away,
they became rather niche. People moved on to other things. The support vector machines were very popular for quite a while. And then along came deep learning where we
learned how to train many, many layers, that’s
deep learning. By the way, the story
I heard from Geoff, I don’t think you mind
me telling you this, that he was very fed up
with the course because he discovered back-propagation
with colleagues, and that they’ve been
quite successful, but then they were
overshadowed by the support vector machines, which is kind of a funny sort of approach to machine
learning in a way. So, when he finally got
neural nets to work properly, he decided to call them
deep learning because that allowed him to call
support vector machines shallow. That’s the real reason. So, what prior knowledge is
built into this? Well, again, there’s
a lot of prior knowledge. It says the output
is determined by this hierarchy of processing. So, let’s take a probe. Let’s imagine I’m going
to take a photograph. I’m going to classify that image
as either happy or sad. Now, what does the computer see? The computer sees pixels. So, how does deep neural
networks solve it? Well, the deep neural net,
the first layer, what it’s doing is looking
for things like contrast, dark regions next
to light regions, and the next layer combine those local contrast
detectors to detect rows of pixels in the image, in which if you have an edge, a dark region separate
from a light region, maybe the next layer looks where edges end or where
they change direction. So, it looks for
things like corners, and a little bit further up, the corners get combined
together to make shapes, things like faces, perhaps expressions on faces, objects that you
see in the image. Maybe the next layer up, it’s looking at the
relationships between objects. Maybe there’s a birthday cake, maybe there are candles,
maybe there are people, maybe the people have
smiles on their faces, maybe at this point,
you’ve got a lot of evidence that this
is a happy image. Our brains are like that too. They have this layer
of processing. They have sent
a surround response oriented edge
detectors and so on. And when we train
artificial systems, we find similar structures
in the layers of visual processing that
we find in the brain. So, there’s one very strong piece of prior
knowledge built-in, which is this
hierarchical processing that seems to be very effective. So, what’s really
going on, the reason deep learning is
working so effectively, in a way of saying this, is that there are lots of problems in the world including
image processing example I just gave you, where this hierarchical
structure seems to work well on real applications,
or put it another way, the prior knowledge that
builds into these deep networks resonates well with the kinds
of problems we’re trying to solve using these networks. Something to say a little bit about the data and
prior knowledge, and we look at some of the other assumptions that
are built into neural nets. So, let’s imagine now that
I’ve got a set of images. My goal is to classify the images according
to whether they contain a person or they don’t. So, here’s an image, and this image does
contain a person, and what we know is
that that classification does not depend on where in the image the person is located. So, these are all examples of images that contain a person. Now, in terms of
the vector of pixels, they’re all very
different, but they all belong to this class. If I want to build
a system that can detect a person irrespective of where the person
is in the image, then one way to do it
is to go and collect huge numbers of images with people in
all possible locations, and then the system will learn they’re all examples of people. The challenge there of course is a bit like that airplane, this very high dimensional
space, the airplane example. I need many, many examples, lots of examples of
images just to capture this notion that
the classification doesn’t depend on location. So, a very sort of
wasteful of data. Another way of doing it is
to generate synthetic data. So, maybe I don’t have data of people in lots of locations, but maybe I’ve
got just one image of a person in one location. I can create
synthetic data in which the person is moved around
into different positions. So, that’s another way of
building prior knowledge, not building it into the model, but effectively
augmenting the data, and that’s quite commonly used. Again, that was quite
wasteful because I have to replicate
the datasets, we end up with a
computationally large dataset. It would be much smarter if we could just give
it one example of a person and then in the model, bake into it the prior knowledge that the output doesn’t
depend upon location. We call that
translation invariance. The way we do that
in neural nets is through convolutional
neural networks. So, this is the input image, and we have
a convolutional layer. In the convolutional layer, each node looks to
the small patch of the image. The node next to it looks
at the next small patch, and then the weight between the blue node and
the red node is shared. So, they can adapt during training that they’re
always in lockstep. So, whatever this blue node
learns to detect, the red node will detect
exactly the same thing but moved slightly because that’s
the convolutional layer. That’s followed by
sub-sampling layer. Again, this node looks at a small patch on
the convolutional layer, and it might do something
like take the max. So, imagine there’s something in this image which causes
the blue node to respond, and that causes
this node to respond. Now, we move it slightly. Now, instead the red node
will respond, but again, because we’re
doing something like a max, this will still respond. So, this now exhibits
translation invariance. It responds even though
the image just moved slightly. Now, what we do in practice
is we repeat this many times, we alternate but in
another convolutional layer, another sub-sampling layer,
and it’s sub-sampling because the resolution of
this is lower than that. Eventually, when we
get to the output, we have a few outputs, and we have
translation invariance. So, if we moved things around,
the output stays the same. This actually encodes a sort of more general kind of
translation invariance because imagine part of the input is translated and
not the other part. Again, the output
will be invariant. So, it’s exhibiting sort of
local translation invariance. Think of a rubber
sheet defamation. Imagine I’ve got
that birthday party, and some of the people
moved around and the birthday cake stays where it is, it’s
still a happy scene. So, we’ve got sort of local as well as global
translation invariance. You can see this is not
a universal black box. This has got a lot of strong prior knowledge baked into the structure
of the network. And if you don’t have
those sorts of structures, good luck with classifying airplanes and
all the rest of it. You’re back in that
exponential space again.>>Okay. So summarized
we’ve got to so far then, we’ve talked about the fact
that there isn’t a machine, universal machine
learning algorithm. That the goal is to
find an algorithm that’s good on the particular
dataset that we have,. That depends upon combining the data with prior knowledge. And the dream is that by being explicit about
the prior knowledge, combining with
an inference algorithm, we’ll discover the machine
learning algorithm, instead of having to read 50,000 newspapers implementable
and compatible. I want you to choose another concept now
in machine learning. So, machine learning, as
you know, this particular, the sort of breakthrough of
deep learning has generated this tremendous hype and excitement around
artificial intelligence. Now, artificial intelligence, the aspiration goes back certainly to Alan Turing
seven decades ago. And the goal is to
produce machines that have all of the
cognitive capabilities of the human brain. It’s a great aspiration, and we’re a very long way
from achieving it. We’ve taken
a tiny step towards it with the recent developments
of machine learning. So does that mean that all this hype about
artificial intelligence, all the excitement and the billions of dollars
investment is all a waste of time because it’s all decades
and decades away? In my view, no. In my view, all of the excitement
around machine learning is totally justified but not because we’re on the brink of
artificial intelligence, we maybe, we maybe not. Maybe it’s centuries
away, or maybe it’s next year, I have no idea, but there is
something happening, which is revolutionary.
It’s transformational. And it’s the transformation in the way we create software, and we’re not really talking about the development process. I’m talking about
the fact that ever since Ada Lovelace programmed the Analytical Engine
for Charles Babbage, she had to specify
exactly what every brass, gear wheel did step by step. And software developers
do the same thing today. It’s a cottage industry
in which the developer tells the computer exactly what
to do step by step. Now, today, of course,
the developer doesn’t have to program
every transistor. They’ll call in some API which evokes a million lines of code written by
other developers, and there are compilers and data machine code
and all the rest. So they’re very effective, very productive compared to Ada Lovelace in terms of their efficiency,
their productivity. But, fundamentally,
we’re still telling the computer how to solve
the problems step by step. Now, machine learning, we’re doing something
radically different. Instead, we’re
programming the computer to learn from experience, and then we’re
training it with data. The software we write
is totally different. The software we write often
has a lot of commonalities. So we’d use neural nets to solve speech
recognition problems, communication
problems, and so on, adapted each time, according to the prior knowledge of
our domain, of course. But we’re doing something
radically different. I think this is a transformation in the nature of software, which is every bit is profound as the development
of photolithography. Photolithography was
a singular moment in the history of hardware. Ever since the days
of Charles Babbage and gear wheels, vacuum tubes, transistors, logic gates, computer hardware has been
getting faster and cheaper. And then we discovered
how to print large scale integrated circuits
using photolithography. And with that,
a transformation because it went exponential.
That’s Moore’s Law. We’re going to print circuits. And, now, the number
of transitional circuit doubles every 18 months. And as they get
smaller, they get faster. Amazing things happen. That’s why we have
the tech intercepts. Why we’re all carrying supercomputers around
in our pockets? Because of photolithography,
because of Moore’s Law. Something interesting
may be happening in software because
the way we’re creating these solutions is by program the computer to learn from experience and then
training it using data. When I see a
Moore’s law of data, the amount of data in
the world is doubling every maybe year or two. And so we are on
the brink of something tremendously exciting and all pervasive through
machine learning. That’s real, that’s
happening right now. One of the things
that it might lead to is artificial intelligence. But even if it
doesn’t, or even if artificial intelligence
is decades away, this is going to transform
every aspect of our lives. One of the areas that I’m
hoping it’ll transform is health care and that’s
a personal interest of mine, but it will be all pervasive. And I do think it’s
transformational. I’ve got the yin and yang
diagram because I think there’s a kind of flipside of
learning from data, which is quantifying
uncertainty. So, again, go back to
traditional computer science. It’s all about logic. It’s all about zeroes and
ones. Everything is binary. The engineers at Intel
and ARM were really hard to make sure every transistor
is unambiguously on or off. We’re in the world of
learning from data. We’re in the world
of uncertainty. We have to deal with ambiguity, so uncertainty is everywhere. Which movie does
the user want to watch? Which word did they write? What did they say? Which web page are
they trying to find? Which link will they click on? Which gesture are they making? What’s the prognosis for
this patient? And so on. In all cases, we never
have a definitive answer, whatever certain, which link
the users going to click on. But they may be
much more likely to click on one link than another, and we can compute
that likelihood using machine learning. Uncertainty is also a heart
of machine learning. So there’s a transformation from logic to thinking
about uncertainty. Of course, you all know there’s a calculus of uncertainty, which is probability is, again, there are mathematical theorems, which show that if
you’re a rational person and you quantify uncertainty, you will do so
using the rules of probability or something that’s mathematically
equivalent to them. So, again, that’s
a mathematical foundation that’s laid a long time ago. That’s not going to change. This we’re thinking
about just very briefly two perspectives
on probability. What do we mean by probability? Well, when we’re in
school, we usually learn a little bit
about probabilities. We learned the frequentist view, the limits of any
infinite number of trials, a frequency, interpretation
of probability. But I’m sure many
of you know there’s a much broader
interpretation which is probability is a
quantification of uncertainty, and that’s the
Bayesian perspective. It’s almost unfortunate that
both are called probability, but the mathematical
discovery is that if you quantify uncertainty
using real numbers, those numbers behave
exactly the same way as the frequencies with
which dice throws behave. And so we called it probability. The fact we use
the same name for both, I think it’s going to flow
a confusion over the years. Let me just give you
a little example. Hopefully, this will
shed some light on this. So imagine we’ve got a coin,
and the coin is bent. The coin is not equally likely to land hit
at one side of the other. Well, imagine, if
I flip the coin, there is a 60
percent probability it will land concave side up, and a 40 percent probability it will land concave side down. Let’s just imagine
that’s the physics of this particular bent coin. What do we mean by
60 percent probability? We mean if we flip it
many times and compute the fraction of times that
lands concave side up, as we go to the limit of an infinite number of
trials, that fraction, which will be sort
of a noisy thing, it will settle down a little
asymptote to some number, and that number will be.6. That’s the frequentist view
of probabilities. Now, let’s suppose that one side of this coin is heads,
the other side is tails. But imagine you don’t
know which it is. All you know is that
the coin is bent, and there’s a 60
percent probability of landing concave side up. Okay. So, Victor,
I’m going to make a big bet with you
with a thousand dollars about whether
the next coin flips are going to be heads or tails. Now, you’re a very rational
and very intelligent person. How are you going to bet? You’re going to bet 50-50. It’s sort of obvious,
right? It’s symmetry. Victor doesn’t believe that if we repeat
the experiment many, many times, that half the time, it will be heads up
and half the time, it will be heads down. What he believes
is that it could either be 60 percent heads, or it will be 40 percent heads. You see, we are flipping
the same coin each time, but we don’t know which it is. So the frequency with which
it lands concave side up, it’s like a frequentist
probability, but uncertainty about whether the next coin flip
is going to be heads or tails is like
a Bayesian probability. And so imagine I’ve
got this bent coin behind the desk here, and I’m flipping the coin. And I’m honest and truthful, and I’m telling you whether
it’s heads or tails. The more data you collect, the more you can discover about whether heads is on the concave side or heads
is on the convex side. As you collect data, you’re uncertainty
about whether the head’s is concave or convex, that uncertainty
gradually reduces. And then the limit with
the infinite number of trials, there’s no uncertainty
left at all, you’re completely certain
about which is concave and whether the heads is on the concave side
or the convex side. You still don’t know
whether the next coin flip is going to be heads or tails. But let’s say you’re certain that the heads is
on the convex side, and you know this is
a 60 percent probability, the next split will be heads. I hope that illustrates
the difference between Bayesian and
frequentist probabilities. That’s the simplest example
I can think of. At this point, you might be thinking, why am I making so much
fuss about this? Because I’ve said that
in traditional computing, everything is zero or one. And now everything is
going to be described by probabilities which lie
between zero and one, and it seems like a tiny change. It seems like
just a little tweak. So, this is an example, this is my illustration of
why it’s not a little tweak, why it’s a profound difference. So imagine, here’s a bus. And let’s suppose the bus
is longer than the car. And we’ll suppose that the car is longer than the bicycle. Okay. Now again, I know
you’re all smart people. So, if I say, the bus
is longer than the car and the car is
longer than the bicycle, do you all agree that the bus must be longer than the bicycle? Okay. If anybody doesn’t agree, go back to the beginning
of the class or something. That’s a very well
known property. We call it transitivity. And here’s the amazing thing. When we go to the world of
probabilities and uncertainty, transitivity need
no longer apply. And there’s a really
simple example of it. And it’s these things. These are called Efron
dice or nontransitive dice. And they’re standardized except they have unusual
choices of numbers. And let’s say, again, we’re determined to get
some money out of Victor, so, I’m going to make a bet, that we’re gonna
have a game of dice. So, we’re going to roll the dice 11 times an odd number and whoever gets
the greatest numbers of wins is going
to get the money. Well, it turns out
that the orange die will beat the red die, two-thirds of the time. So, two-thirds of the time, the orange number will be bigger than the red
number. Big deal. If I play the orange
against blue, two-thirds of the time, blue will give
a bigger number than orange. two-thirds of the time, green will give
a bigger number than blue. And now, here’s
the amazing thing. The bicycle is also
longer than the bus, because two-thirds of the time, green will give
a bigger number than red. Now, if that isn’t
counter-intuitive, I don’t know what is.
It’s bizarre, right? It’s extraordinary and it’s just a consequence of the fact that these are
uncertain numbers, they’re stochastic numbers. And the way it works,
it’s actually very simple. So, these are the numbers
on the different die. So, the orange one actually always rolls a
three as it happens. On the red one,
two-thirds of the time, you get a two, and one-third of
the time, you get a six. So, it’s obvious that in
two-thirds of the time, orange gives you
a bigger number than red. And I’ll leave it as
an exercise for you to check the others. So, occasionally, in
my copious spare time, I sometimes go and give talks
in schools that sort of try and inspire the next generation with excitement of
Machine Learning, Artificial Intelligence,
Computer Science. We actually hand out packs
of these dice to the kids. And if you go to that link, you can actually
read a little bit more about it and you can see those numbers and check
for yourself this is real. So again, I think this
is quite a profound shift from the world of logic
and determinism to, if you like, the real-world
of uncertainty. At this point, I was going
to show a demo and sadly, I can’t show you the demo. So, in fact, I’m just
going to skip over this. The demo was simply
an example of Machine Learning in
operation where the machine learns about my
preferences for movies. And it actually does
so in real-time. So, as I rate movies
as like or dislike, it’s uncertainty
about which movies are like gradually reduces. So, what you’re seeing in the demo is really if
you like the modern view, I like to call it the modern
view of machine learning, not machine learning
as tuning up parameters by
some optimization process, but instead, machine
learning in the sense that the machine has
a model of the world. In this case,
a very simple world, the world of movies that
I like or don’t like, it has uncertainty about the world, expressed
as probabilities. And as it collects data, that uncertainty reduces,
because it’s learned something, rather like
the coin flip example. And we can think of all of machine learning from
that perspective. What I’m going to do now
is give you a tutorial in about one slide on a favorite subject of mine, Probabilistic Graphical models. Because I’m going
to show you how we’re taking steps towards realizing that dream of
model-based Machine Learning. Not just as a philosophy
of Machine Learning, not just as a compass to guide you through
this complex space, but even as a practical tool that we can use in
real-world applications. And to do this, I’m just
going to need to give you a very quick tutorial
on graphical models. If you know about
graphical models already, this will be very boring, and if you don’t know about
graphical models already, you’re going to learn too much, but at least you’ll
get a sense of it. So imagine, I’ve got two boxes, one of them is green,
one of them is blue. And I’m going to pick one
of these boxes at random, but not necessarily with
a 50, 50 probability. It might be 60, 40 or something. And then we’re going
to describe that by a graphical notation. And this graphical notation, I have a circle representing this uncertain quantity.
So, it’s the value jar. So jar is a binary variable
that’s either green or blue, but it’s not a regular variable. It’s not either green or it’s blue or it’s none or something, it has a probability
of being green or blue, it’s an uncertain variable. And this little box just
describes that probability. Now, imagine, that the boxes contain cookies,
biscuits, as we say. These biscuits are either
circular or triangular. And the proportion of biscuits
is different in each box. So, I can now say,
supposing I go to the green box, the green jar, and I pull out a cookie
without looking, then there’s a one-third
probability that it’ll be triangular and two-thirds
that it will be circular. If I go to the blue jar instead, there’s a one-third
probably it will be circular and two-thirds
it will be triangular. Okay? So again, there’s
some uncertainty. If I draw a cookie
out of the jar, we’re uncertain
about which it is. But we know something,
we know this probability. And so, cookie, again, is an uncertain variable that’s either
triangle or circle. It has some probability, but the value of
that probability, depends upon the value of
this random variable jar. So, we can think of
this model in what we call a generative way in which
I do an experiment, I, first of all, randomly, choose a jar, and
then given that jar, I dip in and I randomly
choose a cookie, and that tells me the value of jar and consequently
the value of cookie. That’s a forward model
and that generates data, generates jars, it
generates cookies. And I could repeat
that many times. Now, in real
applications, typically, in this graph, of course, is describing my prior knowledge
about the world. I know the world
consists of jars and it consists of cookies and they relate to each other
in certain ways. So this graph is
a very visual way of expressing that prior knowledge which is obviously critical in as we’ve
seen in Machine Learning. Typically, what
we do though with these graphs is we
observe something, in this case, you
might observe cookie. Or we want to go the other way, we want to work out which jar
did that cookie come from. So, maybe there’s a 60 percent
chance that it’s green. So, it’s more
likely to be green. But now, when I
observe the cookie, I observe that
the cookie is triangular. Now, your intuition
says, if it’s triangular, it’s more likely that
it came from blue than green. And that’s correct. So when you run
the math, you just go base there and very simple, you’ll find that
the probability that it was jar, shifts a little bit
towards blue. You’re just as
your intuition would expect. And so that, if you like, is the Machine Learning process. We’ve observed that I
like a particular movie, and the internal state of
the machine gets updated using sort of based theorem
on steroids to say, I sure am a bit more likely to like action adventure than romantic comedy or
whatever it might be. And that’s a crash tutorial, but Chapter eight of
this amazing book, I’m sure you will have, I hope. Chapter eight is
a free PDF download and that’s a whole chapter
on graphical models. Okay. So, let me illustrate now. I’m going to pick a particular Machine
Learning algorithm, it’s called PCA or Principal
Components Analysis and something everybody learns
about in Machine Learning 101. And first of all, we’re
going to describe PCA the way you’d normally learn
about it from a textbook. And then, I’m going to
show you how to derive PCA using the
model-based perspective, and we’ll use
those graphical models. So, PCA as an algorithm, it’s like a recipe. It’s a recipe that
you apply to data. First of all, it
says, take the data. So, the data will be vectors in some high dimensional space. And there are n of
them and if it says, first of all, average those
vectors to compute the mean, then subtract the mean of
all of those vectors and compute this thing which is the sample co-variance matrix, then find the eigenvalues and eigenvectors of
the sample co-variance matrix, and then keep the eigenvectors corresponded to
the M largest eigenvalues. That in some sense is compressed the data or projected
it down onto an M dimensional subspace in a way that
preserves variance. So, that’s Principal Components as a recipe and you
can code that out, turn the handle, out
would come the answer, and you’d have no idea why
did you pick that recipe. Maybe it works brilliantly
when your’e done, what if it doesn’t work well? What are you going to do now? How are you going to change the recipe so that
it works better? So, if you have no compass, you’re just left with
random trial and error. So, here’s a much better way
of thinking about things. So this is PCA
viewed as a model. So, in the same way that
we’re going to pick a jar, and then choose
a cookie from the jar, I’m going to describe to you
how to generate the data. Because one way of capturing your prior knowledge is to write down how the data
gets generated. So, in this case, it
says pick a vector from a lower
dimensional subspace, from a Gaussian
distribution having zero mean unit variance,
circular Gaussian distribution. So, pick a vector from
that Gaussian distribution, then project it into this high dimensional space with some linear transformation, the space of your data, and then finally, generate a data point by taking
that projected point. Making up the center of
a Gaussian distribution, another Gaussian
distribution that represents the noise and
pick a sample from that. And so, don’t worry
about the details. It’s just a description of how to choose one of the jars and reach and
then pick a cookie. So choose the low
dimensional vector and then generate the high dimensional vector
by adding noise. Another little notation,
it’s called a Plate. It says just repeat
that process n times. So, it says, put the cookie back in the jar, give
it a good shake, close your eyes
again, pick a jar, pick a cookie from
the jar, do it n times.>>So, that generative
process describes how the data gets generated is a great way to express
our prior knowledge. But when we do machine learning, we’re trying to solve
an inverse problem. We have to go back the other
way, which is much harder. So, we observe the data, and we have to make inferences about the points in
the lower dimensional space, and also the values
of the parameters of this linear transformation. And so, we have
to run inference. And then again, it’s
a mathematical proof that this is identical. If you use what’s called maximum likelihood
to deal inference, that is say if you choose
all the parameters to maximize the probability of
the data onto the model, you exactly recover PCA. Now, this point, you think ‘ah, it’s a lot of work just
to get back to PCA.’ So there’s completely
equivalent. So, why is the model base view
so much better? The reason is that if this doesn’t do what
you wanted to do, you can go back and
examine those assumptions. And you can change
the assumptions to better reflect the problem
you’re trying to solve, and then, rederive the model. You haven’t just got a recipe, you’ve got a procedure for arriving at the best model
for your problem. So, just take a simple example, supposing that these drawings, these generated data points, so not generated independently. So, for example, let’s imagine
I’m air traffic control, and I want to know
where the aeroplane is. The aeroplane is
flying across the sky. And once a second, my radar is going to send
that some energy. It’s going to bounce
off the aeroplane, come back and I receive it, and I make a measurement
of where the aeroplane is. Now, the problem is that
that measurement is noisy. So, if I just make
a single measurement, I’ll know where
the aeroplane is roughly, but there’ll be
some uncertainty. Now, we know that if
that’s just random noise, if I make multiple measurements, I can sort of average
out the noise, and get a more certain estimate of where that aeroplane is. So, we’ve going to make
several measurements. The problem is,
the aeroplane is moving. As I make these
measurements, it’s moving. If I just average
the measurements, that will be great, because
I’ll average out the noise. But I’ll also average
out the location, which is what I’m trying to
find. So that’s bad news. If I don’t average, if I just
use the latest measurement, I won’t be averaging
over the motion, but I have a lot of noise.
So what should I do? Well, you could sort
of have some intuition. You could say, “Hmm, I should take
the latest measurement because that’s where
the aeroplane is, but I’ll add in a bit of
the previous measurements to get rid of some of the noise, maybe a little bit of
the measurement for.” But the measurement from
10 minutes ago is irrelevant. So, have some sort of
a weighted average, or I give more weight to
the more recent measurements. That’s sort of your intuition. Actually, that intuition
turns out to be good. That’s actually
what you should do. But how much weight
should you give? What sort of functions should
you use for this decay? How much should you decay by? How do you know what
to do? Your back in the world of recipes, intuition, trial, and error. So instead of that, let’s build a model, in
which we are very explicit, about all the assumptions
we’re going to make, because that’s more
likely to work better. And if it doesn’t, we know how to change things to improve it. So, we’re going to say that here’s the actual position
of the aeroplane in space. Oh, sorry. This is the actual position of
the aeroplane in space. I think we want to know. We don’t know it. It’s unknown. So, the aeroplane
is in some position. And then, we make a measurement, the measurement is noisy. So this is the noise process, but we know that’s value.
That’s the thing we observe. This is the observed position, which are noisy measurement
of the true position. Given that alone, we could estimate this but
have a lot of uncertainty. What’s going to happen
now is the aeroplane is moving across the sky. We could build a model for that. And the simplest model
that we can have is to assume that the uncertainty and the position of
the aeroplane is Gaussian, that the measurement
noise is Gaussian, and that the movement
of the aeroplane across the sky is described
by linear model. So, given its position
and its velocity, we can compute where it will
be at the next timestep. Now, again, we make
another measurement, another noisy measurement
of that next timestep. Now, the aeroplane moves a little bit further and we make another measurements and so on. So, that’s the
generative process. But now, what we need to
do is to run inference. Given these observations,
we need to compute. We need to revise
the probabilities of these aeroplane locations. So, we cannot just that’s
sort of Bayes’ theorem. It’s a more complicated version
to Bayes’ theorem. And it turns out that
that problem can be solved in a very elegant way computationally by passing
messages around the graph. So, we don’t have
time to go into that. It’s the very
beautiful mathematical solution called message parsing. It’s very generic.
But this thing turns out to have a name. It’s called the Kalman filter. It’s been around since
the 50s or whatever. It’s very standard stuff
electrical engineering. When I was writing
my 2006 textbook, I had a chapter on
these times series model, and I read several books
called Kalman filters, introduction to Kalman filters. I find it pretty impenetrable, and it is very complicated, many many chapters where you finally get to
all of this stuff. This is, by far, the simplest way of deriving the Kalman filter that I know, just derive message parsing
and know its generality, and apply it to
this linear Gaussian model. And you get the Kalman
filter equations, in which you say, the posterior
probability of this, the position of the aeroplane at this time depends upon
all of the measurements. This is more sensitive to the current and so, the
recent measurements. And so, you do get
that decay of the weighting the evidence but in
a very precise way that you derive from
the mathematics. And you can even pass measures the other
direction and send information back in time, and get the better estimate
of where the aeroplane was, but making as
a future measurements. Again, it’s your intuition
would indicate. Guess what? If these are not Gaussian, but supposing they’re
discrete variables, again, you just pass
measures back and forth. Now, it’s called the
hidden Markov model. Well, that’s a completely
different literature with completely different
notation and completely different but
equally impenetrable derivations of how all this goes. Again, it’s just
exactly the same model, just slightly
different assumptions. And maybe this works quite well, maybe it doesn’t work
quite well enough, maybe there is some. So, you try this
out on your problem, you find it’s still
not working quite well enough, you know what to do. It could be, maybe there’s a problem with the data
that you’ve collected, maybe there’s a problem
with the inference, because most inference
algorithms are approximate. For the Kalman filter,
it’s exact. But once you get to
more complex models, you always take
approximate inference. And maybe your inference
algorithm had some issues, or maybe your prior assumptions
were not correct. Maybe you need to refine them for the problem you’re
trying to solve. You know how to do that
because you made an explicit. So, maybe this noise isn’t
Gaussian. Maybe real radar. So you’re going to talk
to radar engineer and find out what the noise really is like and
then, model it. And you get better results. Okay. So, I think it’s more
or less my final slide. What I’ve shown you so far
is really a philosophy, a viewpoint of machine learning
that I hope helps provide you with a compass
to guide you through this complex morass
of algorithms, but also a practical tool to use when you’re building
real world applications. But at the back of
our minds, we have a dream. And the dream is that we
can somehow automate this. We can provide tools so that people who haven’t read
all the textbooks on neural net, I mean, machine
learning and so on. You need to buy
the textbooks, by the way. You don’t need to
read them, just so everybody is
clear about that, or one in particular anyway. But say you read all that stuff and learn all about
this, can we automate it? Can we provide
tools that will help democratize this approach
to machine learning? And so, this is the dream. So, if you think about coding up inference for
a complicated problem, like the movie
recommender example, it’s pretty complicated stuff, thousands of lines of code. It’s written on
machine learning experts. You know about the modeling,
know about inference, know how to code
up the inference in the context of those models. This is all complicated stuff. All written in C++ or whatever
your favorite language is, compiled down to machine code, combined with the data, lots of compute happens and you get your predictions
with uncertainty. What if instead, we could write a thing which we call
a probabilistic program. So, probabilistic program is just a very short piece of code written in
some appropriate language, which effectively describes what that probabilistic model, that graphical model describes. So, it will almost say pick one of the jars
with this probability, and then, for that jar, pick a cookie with
a certain probability; or the aeroplane is in
this position in the sky, and one second later, it’s
moved to a new position, and I’m going to
make measurements. The measurements have
Gaussian noise or something. It’s just a simple description
and a few lines of code, maybe if we’re lucky, tens of lines of code, that describes the generative
process of the data, or describes it
very clear intuitive form the prior knowledge that
we’re baking in to our model. And we are going to
a piece of magic, which is a probabilistic
program compiler, which is going to take
this high level description and generate these 1,000 of
lines of code automatically. So that’s the dream. We
haven’t achieved the dream, but we have made
a lot of progress. We’ve built a compiler. And if you go to infernet, you can download infer.net. And there’s lots of tutorials
and examples and so on. And infer.net doesn’t
cover every possible case, but it covers a lot
of common cases. And for those cases
which it is applicable, you do have this automation. And of course,
the whole time, we are looking to extend it
and generalize it. So, it is quite an exciting
program for search. And so, we’re going to
leave you with this. This is the graphical model with the random variables,
the probabilities, and the plates for
the movie recommender problem, the problem of
recommending movies. So here, we have uses. So, I’ll stand back so
I can read the writing. Okay. User bias feature weights. So, what we’ve got here are
features about the user. It might be age, gender, geographic location,
anything which might influence
what movies are like. Here, we’ve got
features of the items. So it might be
the duration of the movie, the actors, whatever,
perhaps genre, action, adventure,
romantic, comedy and so on. And then, we also have
in here information, which is we call
collaborative filtering. So that’s people who’ve like the movies you’ve liked so far also like
these other movies, so perhaps, you’ll like them. But not coded up as some sort of hacky piece of intuition, but just described by
probabilistic model, a very precise
probabilistic model. And so, this can be cast in a few dozen lines
of infer.net code. And then, the
inference algorithm can be compiled automatically. And so, right down
here, we have the thing we observe. That’s the ratings. That’s somebody saying I
like this movie or I don’t or this movie has five stars,
this is one star. Once we make observations from a user about
which movies they like, we send information, we
pass machines up this graph, revise probabilities for
these hidden variables, send messages back down again, and we get revised probabilities which are ratings for movies
the person hasn’t yet seen. And so we update the probability
I’m going to like some unseen movie based on ratings I’ve given
to movies I have seen, plus all the ratings
that thousands of other people have given to
that movie and other movies. That’s how that works. And again that’s all recorded
in infer.net. And so we leave you
with another book, but the good news is this book
is online, it’s free. Will be forever more. It’s called Model-Based
Machine Learning. It’s co-authored with John Winn and John has actually done overwhelmingly the bulk
of the work on this book. So I’m very much
the second author. It’s really John’s baby. This is a very unusual book. There’s a little introductory
chapter but there after, every chapter is
a real world case study. We’ve chosen examples from Microsoft because that’s
what we know about, and these are things
we’ve worked on. And in each case, we
start with the problem. The problem we’re
trying to solve, we’re trying to match different players on Xbox
so they’ll have a good game, in other words, they’ll be
similarly matched in strength. That’s the problem
we’re trying to solve. We’ll describe
the data that we have, we’ll describe
the prior knowledge, the assumptions
we’re going to make. We derive the Machine
Learning algorithm, we test it out from the data, we find it doesn’t
work very well, because that is what
happens in practice. Anybody who’s ever tried Machine Learning
in the real world, the first thing you try
generally doesn’t work. And then we go
back to debugging, was there a problem with
the data that we collected? Was there a problem
with the inference, the approximate inference
algorithms we used? Or was it a problem with
the assumptions that we made? And so we go and we revise
the assumptions and then run it again and of course every chapter has a happy ending. We get good results and it ships and is used by
millions of people. But it’s a little bit more honest about
the process by which we arrived at those solutions
and it shows you how, I hope, for each
of these examples, and they’re drawn from
very different domains, medical examples and so on, in each case, hopefully you can see how by making
the assumptions explicit, that critical prior knowledge, by making it explicit, it gives you a compass to
guide you through the process of revising and refining the solution and getting
it to work properly. Otherwise you’re left
with a big space of trial and error and not
knowing what to try next. So with that, thank
you very much.>>Thank you Chris
for that great talk. And we’ll probably take
a couple of questions and then, yeah, you have the mics? Okay, so this hand
going up first maybe one mic for the gentleman here.>>Hello sir, I’m [inaudible].
Thank you for the very nice talk. So I have one question that, by restricting to the class
of probabilistic models, are we losing something? What is your thought on
that? Because, there are neural nets which are
not probabilistic models.>>Yeah. Several
thoughts on that. First of all,
the probabilistic view of Machine Learning
is a general one. So the qualification
of uncertainty using probabilities is the only rational way to deal
with uncertainty. In practice, we often can’t deal with
probabilities exactly, we generally have to
make approximations. One extreme approximation
is a point estimate. So we place some complicated distribution with
a single value. That single value will be
chosen in some way that might be maximum
likelihood for example. So if you’re taking a neural net
and you’re training by minimizing error which
is a lot likelihood, a lot [inaudible]
noise distribution, then you’re approximating
that probabilistic inference, may it’d be a very
drastic approximation. And the bigger, the
more complex the model, the more data you have, the more performant you need to be, typically, the more radical the approximations you have to make in order to get something
that’s tractable and sufficient to
perform your application. So it is quite general.
Generally speaking though, although you may not be able to maintain full
probability distributions of all of the internal variables like you did in
the movie example, so all the internal weights
of the neural net. Nevertheless, the outputs almost invariably should
be probabilities. So I would say as a rule, whenever you’re
making predictions, they should always be
probabilistic predictions. One of the problems
of Support Vector Machines they’re just
intrinsically with no probabilistic and there are ways of fixing
it up afterwards. So when you make
a probabilistic prediction, instead of saying this person
has cancer or they don’t, you say there’s a 37 percent
chance they have cancer. First of all, you can threshold it and it’s back to a decision, but can do so much more. For example, maybe the cost of taking somebody
who has cancer, misdiagnosing as
not having cancer is much worse than
say somebody who’s healthy and diagnosing with cancer because in
the first case they might die in the second
case they might get upset and need
some further tests. So that loss measure is are very asymmetric
and if you’ve got probabilities up you can take that into
account correctly. You can use
probabilities to combine the outputs from
multiple systems so like a Euro of uncertainty or a universal currency you can
combine different systems. You can do things at
threshold you can say, I’m going to make
a decision when my confidence is
above a certain level, if my confidence is
below that level, I’m going to send off a human. So if you’ve got some very repetitive
task medical screening where people staring down
microscopes all day long, you might be able
to help them by just taking 90 percent of
the data and I’m very confident this is not
cancerous but everything else we’re going to leave
to human judgment. That’s a very practical
thing to do today. So lots and lots of
advantage of having probabilistic predictions
and no downside. It’s always, always
your probabilities. Okay, Masam has a question
then we’ll take one more. So two more questions
then we’ll wrap. I’m Masam from IoT Delhi. Thank you for
the very nice talk. I learnt my AI in
the early 2000s, and that was the time
probabilistic graphical models
were at the peak, and I’m a application
researcher, I work in Natural
Language processing and I remember conferences
where pretty much all the papers except a very few were all probabilistic
graphical models based sometimes at some point
it became LDM based and so on and so forth. Of course, there’s
a new world order, and so I find very few papers in the application area and I’m
not talking about people who look in the 2D and the
fundamentals of Machine Learning. There’s a lot of work
still going on in there and some unsupervised
learning as well. But in the application domains, it is all neural networks
left right and centre and Probabilistic
Graphical models are either not being tried or have been overtaken and life has changed. So I want to understand your perspective in the future as you know in the time to come, what do you see as the role of Probabilistic
Graphical models based solutions in application areas. Do you believe that they will
still have a strong role to play or do you believe that they will be overtaken
by neural networks? If they will have a role to play would it be in
conjunction with neural? What is the value it will
offer when it’s using a BGM solution the right
solution to approach?>>Sure. So you’ve got to understand that
Machine Learning like everything else is
a social enterprise. All right. So let’s
take neural networks. There was tremendous
excitement in 1960s around positrons
because machines could learn and you could cut 10 percent of
the wires and it carried on working just not quite
as well just like the brain. So
tremendous excitement. Then all went away again,
and then all came back again in the 1980s, 1990s. Neural nets were
the solution to everything. Then it all went away again
and then all came back and. So right now it’s all- And
in the application domain, we’ve got these
particular techniques, certain classes of
covolution nets and LCMs, and a handful of things. So working very well on certain
problems for which we can get lots of data we
can label up by hand. Many, many practical
applications. So it’s unsurprising that this tremendous focus of applications is bearing
down on this one set up. We’ve discovered
this new technique and everyone applying it in all kinds of
places. That’s unsurprising. If you step back and look at the field of Machine Learning, it’s a very broad field and this discriminative
training based on hand level data
was one tiny corner, which has all kinds of, I think you know the last
speaker covered some of these, there are so many
limitations that are scratching the surface of what we want to do
with machine learning. The whole world of reinforced learning,
unsupervised learning, that [inaudible] and
somebody mentioned the work it burned
choked and others, they’re all issues about
bias and in learning. Think of the world of
Machine Learning as this enormous opportunity that’s out there in front of us, and then right now there’s
a whole bunch of people, for understandable
and good reasons, focused on music and particular
technical applications. So first of all, probabilities
are the foundation, there’s a mathematical
theorem that says if you’re behaving rationally
and you’re not certain, you’re going to use
probabilities or something equivalent. So
it’s not going to go away. I don’t th ink the Maths
going to change. The graphical models are
just a very beautiful notation. Personally, I find
a picture is worth a thousand equations and it’s just much easier to look
at a picture and see what it’s saying than
pages and pages of Maths. So I don’t think
the pictures are going to go away any time soon. But your question is really
about practical applications, and there’re
so many applications we’ll be working
on applications, you’ll see examples in the book. Where just throwing a neural network is not
the right way to go. We’re actually
graphical models are the appropriate tool
and technique to use. So I can’t predict what
the next wave is going to be, maybe reinforcement
learning will dig in and get
some real traction, everyone will lurch
across and start applying reinforcement
to everything. But in terms of
the field Machine learning, what an amazing time to
be going into the field. We’re just at
the beginning of this. My son is at university
doing Computer Science. He’s interested in
Machine Learning. I think well, that’s great.
There’s a whole career to be built in this because we’re just at
the beginning of this.>>But just a tiny follow up. So you said that you tried neural networks in some
applications that didn’t work, I’m really happy to
hear that but can you characterize what kinds
of settings do you expect neural networks
to not do well where PDM would be the solution
in unsupervised scenario?>>Sure. Just an
example would be the skill matching
example in Xbox. So again, it’s
a chapter in the book. Where what are your assumptions? When you’ve got
some players and they have some skill and you have
some uncertainty in their skills which we described
actually by the simplest possible [inaudible] as
the Gaussian distribution. And then they play
against each other, and I know you have some model
for how their performance varies because the stronger
player will sometimes lose to the weaker player
because they didn’t play too well in
that particular game. And that’s how we
model all of that. And in fact,
actually if you take that model and look at just
the maximum likelihood limit, we throw away the uncertainty you come up with
something called elo, which is the standard method
used in chess worldwide. So, that’s a model which is appropriate to
that particular application. So again, it all comes down to getting the sort of
fundamental point of it all, is that there isn’t such a thing as a universal algorithm. Again there’s a mathematical
theorem that proves that, it’s about building
the right kind of solution that’s tailored
to your problem. So, you’ll see some examples
of that in the book.>>Okay, we’ll take
one last question here. Second row yeah.>>Hi. This is [inaudible] from
Ministry of Technology, Delhi. And, I’m a PhD student there. And it is really
heartening to see you talking more about the Probabilistic
Graphical Models. I work in Probabilistic
Graphical Models and at times in these times, it becomes scary that there
I’m working in the right area where the whole world is
talking about Deep Learning, so it gives me
a sense of security. First that a person like you is propagating that.
Thanks for that. So, the question is, basically
so you shift the onus from the algorithms to the Probabilistic Models
and assumptions. But then, when you walk into the Probabilistic
Graphical Models another question arises
like how you choose. I think the same problem
gets shifted to how you choose which
approximate inference algorithm to choose from? Like if I work in
structured prediction there are ample amount of approximate
inference techniques, variational inference to MC MC. There is some
understanding on that, but I think the problem
has just shifted to what approximate
inference algorithm you will use for the
Probabilistic Graphical Model. That is the first question
and the second is at a higher level. So you talk about that there is no single algorithm as
such and you have to adapt, you have to see the problem, understand assumptions, and then see which algorithms work there. On the contrary philosophy of, if I understand it correctly, Pedro Domingos talk about the Master Algorithm
which will work, I believe we’re almost. So what are
your thoughts on that?>>Okay yeah. First of all
just I don’t want to say the impression is
Graphical Models over here and there’s Neural Nets
over there and you choose one or you
choose the other. Deep Learning is
the ability to train these deep hierarchical
layered structures and you might describe your
problem by a graphical model, but maybe one of
those conditional probabilities is
a Deep Neural Net. So these are not alternatives
they’re more like, I think of the
Probabilistic framework in the graphical model as a way of, again it’s more like
a compass to guide your way around the world of
Machine Learning. Deep Learning is a very powerful technique and
it’s cropping up in many, many different places
it will be used a lot. So, I don’t want to characterize
them as alternatives, but I do like
the Graphical Models as sort of a general framework
for describing models. So, sorry the second part
of the question was the. Yeah okay so.>>[inaudible].>>I mean just the lack
of time I’ve said very little about
Approximate Inference. Again, that model based
machine learning book guides you through some of
the inference methods that we’re using in that context. And again in real world
applications you make approximations and
those approximations, you know you might have a complicated
multi-modal distribution, you might approximate
by Gaussian which is uni-modal and you’re losing
some sort of uncertainty, some ambiguity there, and that may or may
not be important. So part of the challenge is, when you don’t get
the results you need is diagnosing where
the problem goes wrong. So making the bad assumptions, inappropriate assumptions
is just one of the places. If somebody hands
you rubbish data that isn’t what it claims to be, then you can just
get bad results even if your assumptions
are correct. And the same thing with
the Inference Algorithm, that’s a whole very complex
world in its own right. And in essence
the goal of infer.net is to hide that from you. You can focus as
the domain expert on your prior knowledge that you know because you’re an expert in Medical Imaging because
you’re an oncologist, or whatever, you don’t have to know anything
about inference. And the ultimate treat of infer.net is the inference
will be entirely automatic and we’re not there yet, but we’ve made progress.>>Okay. I think we should
wrap up for now because we all need that kind of
[inaudible] time. I’ll request [inaudible] to say
a vote of thanks for Chris, yeah. Thank you very much.
Thank you Chris.

5 thoughts on “Keynote Talk: Model Based Machine Learning

  1. You say that the model of a person restricts the expressions of freedom so that less data is needed to conclude that any data is a person – fine. But then you say that a convolutional layer represent a model of a person when it clearly is not. Rather the convolutional layer has parameters which iteratively change according to an algorithm to converge towards to a structured representation of a person. Clearly the product is the model we're after and the convolutional layer is more like a 'meta model'. It seems reasonable that the more degrees of freedom this meta model allows for the more kinds of models it can derive, right?

  2. Part of the confusion about applicability of GPM was due to the fact that, after 2014 , November there was no release for infer.net. Is there any work happening or , is it getting interated to CNTK?

  3. At 9:08 he compares a simple problem (electrical current prediction given a certain voltage) with a much complicated problem of image recognition. IMHO to differentiate between computationally and statistically big data the targeted problem should be the same. So for instance the plane recognition problem, thousands of pictures of the similar type of aeroplane are computationally large but statistically insufficient. However may be a reduced number of pictures of various models may suffice statistically.

Leave a Reply

Your email address will not be published. Required fields are marked *