>>It’s my absolute pleasure

to introduce Chris Bishop. Chris Bishop is

a Microsoft Technical Fellow. He also is the

Managing Director of our Cambridge Research Lab. He’s also Professor

of Computer Science at the University of Edinburgh and a Fellow of the Darwin

College Cambridge. In 2014, he was elected Fellow of the Royal

Academy of Engineering. And in 2007, he was elected Fellow of the Royal

Society of Edinburgh. And in 2017, he was elected as a Fellow

of the Royal Society. It is a long list ofachievements

and accolades that I can go on talking about, right rather late, Chris’ talk

will do the talking. I’m sure he won’t disappoint. So, without much further ado,

Chris, all yours.>>Thank you very much. Thanks for the

invitation to come here. It’s a great privilege

to be the final speaker. I thought what I do

for this talk is, rather than talk about any particular application

or particular algorithms, is to step right

up to 50,000 feet. I think about machine learning, and what it is all about? What are we trying to achieve? And in particular, to

give you a perspective on machine learning that I call Model-Based Machine Learning, which you can think

of as a compass to guide you through

this very complex world. So, Machine Learning can

be very intimidating. There are many, many algorithms. Here are a few. Every year, next hundreds more

are published. You’ve heard about lots today. And especially, if you’re

a newcomer to the field, it’s bewildering,

it’s intimidating. Which ones you need

to learn about? Which ones should you

use for your application? It can be really challenging. So, Model-Based

Machine Learning is just a perspective

that I hope will help go into on your journey

through machine learning, whether you’re working on new algorithms or in particular, whether you’re working on

real-world applications. So, coming back to all

of these algorithms, you might be a little

frustrated and say, “Why do I have to learn about hundreds or thousands

of different algorithms? Why can’t these machine

learning people just come up with the one

universal algorithm?” In fact, maybe they have,

maybe steep neural networks. “Have steep neural networks solve all of

humanity’s problems. I don’t need to learn

about the rest.” Well, there’s a mathematical

theorem, it’s proven. So, it’s unlikely to

be retracted anytime. And it’s called the ‘No

Free Lunch’ Theorem. And it’s by Daniel Wolpert

back in 1996. This is the averaged overall possible

data-generating distributions. You can think about

as no averaged overall the possible problems you

could ever want to solve. Every classification

algorithm has the same error rates when classifying previously

unobserved points. That means if an algorithm is particularly good

at one problem, it will be particularly

bad at some other problem. Put it another way,

there is no such thing, as a universal machine

learning algorithm. That is not my personal opinion. It’s a mathematical theorem. Well, to put it

another way, the goal of machine learning is not to find the universal algorithm

because it doesn’t exist. But instead, to find

an algorithm that is in some sense well matched to the particular problem that

you’re trying to solve. Okay, so this is

very fundamental, very cart of machine learning. So, machine learning, we

all know, depends on data. But we cannot learn

just from data. We need to combine data with

something else with a model or we can think of this as constraints or we can think

of it as prior knowledge. I’ll use the term

prior knowledge, but people call it lots

of different things. You cannot learn

from data alone. Otherwise, we’ll have a sort

of universal algorithm. So, we need to combine data with this prior knowledge in

order to make any progress. Now, we also know,

and strictly from the recent developments in deep learning that the more data

you have, the better. And in some sense, if you have lots of data, you can get away with

a little bit of prior knowledge. Or conversely, if you’re in a world where you have

very limited data, then you need to complement that with a lot of

prior knowledge, very strong assumptions about the problem you’re

trying to solve. Now, what’s interesting is the meaning of

this vertical axis. What do we mean

by a lot of data? So, this is a really

important point. I want to talk about big data and what we mean by the size of a data set because there are two completely

different meanings to the size of the data set. It is very important not

to get them confused. There’s the computational size, which is just how many bytes

does it take up on disk. And there’s

the statistical size, which relates to

its information content. So, we illustrate this with a couple of,

sort of corner cases. So, the first example, imagine we have

a block of metal. We apply a voltage and the current flows through

the block of metal, and we’re going to

measure how much current flows when we apply

a particular voltage. And we’ve got

seven measurements here. As we’ve applied

seven different voltages, we’ve measured

the corresponding values of current and our goal

is to generalize. This is a machine

learning problem, and so, our goal is to predict the

current for some new value of voltage on which we haven’t

yet made a measurement. Now, this case some kind and friendly physicists

has come along and told us about Ohm’s Law. An Ohm’s Law just says, current is proportion

to voltage. It’s a straight line

through the origin. The only thing we have

to learn is the slope. The data points I have shown

have measurement errors. These are real-world

measurements. They’re a little bit noisy. If they weren’t

noisy, one data point will determine

the slope exactly. But the data points

are noisy and there’s only a finite

number of them. And so, we don’t know

the slope exactly. But if we’ve got

seven measurements, and the noise is not too high, we can be pretty confident

about that slope. There’s not

very much uncertainty. This is a data set,

which is computationally small because it’s seven pairs

of floating point numbers. So, computationally,

it’s a tiny data set, but statistically, it’s

very large data set. In other words, if I gave you another million measurements

of currents and voltage, then your uncertainty

on the slope will get a little bit smaller, but it’s already very small. So, the next billion data points are not going to make

a lot of difference. You’re already in

the large data regime from a statistical point of view. Think about another corner case. Imagine, we’re going

to have some images. I’m going to label

the images according to the object that they contain. So, it might be

airplane, car, and so on. And these images

might have millions of pixels that are occupying many megabytes

each on your disk. And we might have

a billion images of each class, a billion examples of airplanes and a billion example

of bicycles. So, this is going to take up

a huge amount of disk space. So, this is a data set, which is computationally very large, big data, in the usual sense. But what about statistically? Well let’s imagine,

I’m naive and I just treat these images as vectors and feed them into my favorite, whatever neural network

as a classifier. If you think about the airplane, the airplane could be

anywhere in the image, that’s two degrees of freedom. Actually, it can be

any distance as well. So, three degrees of

freedom of translation, three degrees of

freedom of rotation. Your planes come in different

colors, different shapes, different illuminations,

but all of these degrees of freedom can be taken together

combinatorically. So, if I showed you

one image a second, and you’ll all agree that

every image was an airplane, how long before I

run out of images? Well the answer is, far longer than the age

of the universe. I mean the number of images

that we all agree are airplanes is vast compared to the number of

electrons in the universe. So, if you just have a very naive approach to

classifying these objects, then even a billion images of each class is a

tiny, tiny data set. So, it is computationally

large data set that is statistically small. I’ll just go back for a second

to the previous picture. This refers not to the

computational size of the data, but to the statistical size. So, that’s the concept of

prior knowledge in the data and the concept of

the size of the data set. So, coming back to

this problem then of which algorithm

am I going to use? How am I going to address

this problem of just thousands, of thousands of

different algorithms? So, I want to introduce

you to the philosophy, if you like, with

Model-Based Machine Learning. But it’s a very

practical philosophy. So, the idea of this is not to have to learn

every algorithm there is. It is not to try out every algorithm and empirically

see which works best. The dream of Model-Based

Machine Learning is instead to derive the appropriate machine learning algorithm

for your problem. Essentially, by making

this prior knowledge explicit, which I will show you how

that works in a minute. So, traditionally we say,

How do I map my problem onto one of the

standard algorithms? And often, that’s not

clear and so, typically, people will try out lots

of different things, they try decision

trees and nets, and small vector

machines, and so on. Instead, in the

model-based view, we say, What is the model that

represents my problem? What is the model that

captures my prior knowledge? And so, by forcing

ourselves to make these prior

assumptions explicit, we have a compass to guide us to the correct algorithm or

these sets of algorithms. So, the idea is

the Machine Learning Algorithm is no longer

the first class citizen. Instead, it’s the model. It’s the set of assumptions, and there are set of

assumptions that are specific to the problem

you’re trying to solve. So, if your problem,

you’ll have one set of a problem, set

of assumptions, you’ll have a different set

of assumptions, you will arrive at

different algorithms. And that’s why

there’s no such thing as a universal algorithm. The algorithm that it’s tuned to the particular problem

we’re trying to solve and that’s reflected

in this domain knowledge, these assumptions

as prior knowledge. So, we take the model, the prior knowledge, we combine it with an inference method. The inference methods

tend to be fairly generic. So, the inference methods with things like, gradient descent. That’s in your net.

So, expectation of propagation if we’re

looking at graphical models. General techniques for

optimizing or computing the posterior distribution

of parameters of a model and together they define the machine

learning algorithm. So, the dream is, you write down explicitly

your assumptions. You choose an appropriate

inference method and then you derive

the machine learning algorithm. And when you apply

it to your problem, it will be widely successful. So, that’s the dream. Now,

we’re not entirely there yet. But I’ll show you

some great examples. Let’s talk a little bit about the assumptions

that go into models. If you will get

deep neural net, you’ll think, well, they’re not

making any assumptions, they’re just generic universal machine learning algorithms, you pour data you wanted, the magic comes out the other. So, where are the assumptions

in the neural net? So, let’s look at, if you like the simplest neural

net algorithm I suppose it’s

logistic regression. This is making a very,

very strong assumption. This is a lot of

prior knowledge. That prior knowledge histogram,

that’s sort of it’s high, because it’s restricting us

to very, very narrow domains. It’s making very

strong assumptions. It’s saying that

the prediction Y is some linear combination of the inputs passed through

some simple model nonlinearity. That’s a very, very

specific model, and if we have multiple outputs

at the same time, then we arrive at

a single layer, neural net. That’s like lots of logistic regressions happening

and all at the same time, and that of course

was the type of model that people were

excited about in the first wave of neural nets in the days of the

Perceptron and so on. The second wave of excitement of neural nets in

the late 1980s, early 1990s, was when back-propagation

came along, we learned to train

two-layer nets, in which these features themselves could be

learned from data, and it was a very exciting time. I actually made

a crazy decision, which was to abandon what was

a very successful career in physics because I’d read Geoff Hinton’s paper on

backprop and I thought, “Wow, machines that can learn

artificial intelligence. This is the future.”

And I gave up my career. I persuaded my boss

to buy me a computer. I taught myself to program, I have never done that before, got some C code and

started hacking away. So, that was the second phase of excitement

around neural nets. Then of course, they

went away again. They didn’t really go away,

they became rather niche. People moved on to other things. The support vector machines were very popular for quite a while. And then along came deep learning where we

learned how to train many, many layers, that’s

deep learning. By the way, the story

I heard from Geoff, I don’t think you mind

me telling you this, that he was very fed up

with the course because he discovered back-propagation

with colleagues, and that they’ve been

quite successful, but then they were

overshadowed by the support vector machines, which is kind of a funny sort of approach to machine

learning in a way. So, when he finally got

neural nets to work properly, he decided to call them

deep learning because that allowed him to call

support vector machines shallow. That’s the real reason. So, what prior knowledge is

built into this? Well, again, there’s

a lot of prior knowledge. It says the output

is determined by this hierarchy of processing. So, let’s take a probe. Let’s imagine I’m going

to take a photograph. I’m going to classify that image

as either happy or sad. Now, what does the computer see? The computer sees pixels. So, how does deep neural

networks solve it? Well, the deep neural net,

the first layer, what it’s doing is looking

for things like contrast, dark regions next

to light regions, and the next layer combine those local contrast

detectors to detect rows of pixels in the image, in which if you have an edge, a dark region separate

from a light region, maybe the next layer looks where edges end or where

they change direction. So, it looks for

things like corners, and a little bit further up, the corners get combined

together to make shapes, things like faces, perhaps expressions on faces, objects that you

see in the image. Maybe the next layer up, it’s looking at the

relationships between objects. Maybe there’s a birthday cake, maybe there are candles,

maybe there are people, maybe the people have

smiles on their faces, maybe at this point,

you’ve got a lot of evidence that this

is a happy image. Our brains are like that too. They have this layer

of processing. They have sent

a surround response oriented edge

detectors and so on. And when we train

artificial systems, we find similar structures

in the layers of visual processing that

we find in the brain. So, there’s one very strong piece of prior

knowledge built-in, which is this

hierarchical processing that seems to be very effective. So, what’s really

going on, the reason deep learning is

working so effectively, in a way of saying this, is that there are lots of problems in the world including

image processing example I just gave you, where this hierarchical

structure seems to work well on real applications,

or put it another way, the prior knowledge that

builds into these deep networks resonates well with the kinds

of problems we’re trying to solve using these networks. Something to say a little bit about the data and

prior knowledge, and we look at some of the other assumptions that

are built into neural nets. So, let’s imagine now that

I’ve got a set of images. My goal is to classify the images according

to whether they contain a person or they don’t. So, here’s an image, and this image does

contain a person, and what we know is

that that classification does not depend on where in the image the person is located. So, these are all examples of images that contain a person. Now, in terms of

the vector of pixels, they’re all very

different, but they all belong to this class. If I want to build

a system that can detect a person irrespective of where the person

is in the image, then one way to do it

is to go and collect huge numbers of images with people in

all possible locations, and then the system will learn they’re all examples of people. The challenge there of course is a bit like that airplane, this very high dimensional

space, the airplane example. I need many, many examples, lots of examples of

images just to capture this notion that

the classification doesn’t depend on location. So, a very sort of

wasteful of data. Another way of doing it is

to generate synthetic data. So, maybe I don’t have data of people in lots of locations, but maybe I’ve

got just one image of a person in one location. I can create

synthetic data in which the person is moved around

into different positions. So, that’s another way of

building prior knowledge, not building it into the model, but effectively

augmenting the data, and that’s quite commonly used. Again, that was quite

wasteful because I have to replicate

the datasets, we end up with a

computationally large dataset. It would be much smarter if we could just give

it one example of a person and then in the model, bake into it the prior knowledge that the output doesn’t

depend upon location. We call that

translation invariance. The way we do that

in neural nets is through convolutional

neural networks. So, this is the input image, and we have

a convolutional layer. In the convolutional layer, each node looks to

the small patch of the image. The node next to it looks

at the next small patch, and then the weight between the blue node and

the red node is shared. So, they can adapt during training that they’re

always in lockstep. So, whatever this blue node

learns to detect, the red node will detect

exactly the same thing but moved slightly because that’s

the convolutional layer. That’s followed by

sub-sampling layer. Again, this node looks at a small patch on

the convolutional layer, and it might do something

like take the max. So, imagine there’s something in this image which causes

the blue node to respond, and that causes

this node to respond. Now, we move it slightly. Now, instead the red node

will respond, but again, because we’re

doing something like a max, this will still respond. So, this now exhibits

translation invariance. It responds even though

the image just moved slightly. Now, what we do in practice

is we repeat this many times, we alternate but in

another convolutional layer, another sub-sampling layer,

and it’s sub-sampling because the resolution of

this is lower than that. Eventually, when we

get to the output, we have a few outputs, and we have

translation invariance. So, if we moved things around,

the output stays the same. This actually encodes a sort of more general kind of

translation invariance because imagine part of the input is translated and

not the other part. Again, the output

will be invariant. So, it’s exhibiting sort of

local translation invariance. Think of a rubber

sheet defamation. Imagine I’ve got

that birthday party, and some of the people

moved around and the birthday cake stays where it is, it’s

still a happy scene. So, we’ve got sort of local as well as global

translation invariance. You can see this is not

a universal black box. This has got a lot of strong prior knowledge baked into the structure

of the network. And if you don’t have

those sorts of structures, good luck with classifying airplanes and

all the rest of it. You’re back in that

exponential space again.>>Okay. So summarized

we’ve got to so far then, we’ve talked about the fact

that there isn’t a machine, universal machine

learning algorithm. That the goal is to

find an algorithm that’s good on the particular

dataset that we have,. That depends upon combining the data with prior knowledge. And the dream is that by being explicit about

the prior knowledge, combining with

an inference algorithm, we’ll discover the machine

learning algorithm, instead of having to read 50,000 newspapers implementable

and compatible. I want you to choose another concept now

in machine learning. So, machine learning, as

you know, this particular, the sort of breakthrough of

deep learning has generated this tremendous hype and excitement around

artificial intelligence. Now, artificial intelligence, the aspiration goes back certainly to Alan Turing

seven decades ago. And the goal is to

produce machines that have all of the

cognitive capabilities of the human brain. It’s a great aspiration, and we’re a very long way

from achieving it. We’ve taken

a tiny step towards it with the recent developments

of machine learning. So does that mean that all this hype about

artificial intelligence, all the excitement and the billions of dollars

investment is all a waste of time because it’s all decades

and decades away? In my view, no. In my view, all of the excitement

around machine learning is totally justified but not because we’re on the brink of

artificial intelligence, we maybe, we maybe not. Maybe it’s centuries

away, or maybe it’s next year, I have no idea, but there is

something happening, which is revolutionary.

It’s transformational. And it’s the transformation in the way we create software, and we’re not really talking about the development process. I’m talking about

the fact that ever since Ada Lovelace programmed the Analytical Engine

for Charles Babbage, she had to specify

exactly what every brass, gear wheel did step by step. And software developers

do the same thing today. It’s a cottage industry

in which the developer tells the computer exactly what

to do step by step. Now, today, of course,

the developer doesn’t have to program

every transistor. They’ll call in some API which evokes a million lines of code written by

other developers, and there are compilers and data machine code

and all the rest. So they’re very effective, very productive compared to Ada Lovelace in terms of their efficiency,

their productivity. But, fundamentally,

we’re still telling the computer how to solve

the problems step by step. Now, machine learning, we’re doing something

radically different. Instead, we’re

programming the computer to learn from experience, and then we’re

training it with data. The software we write

is totally different. The software we write often

has a lot of commonalities. So we’d use neural nets to solve speech

recognition problems, communication

problems, and so on, adapted each time, according to the prior knowledge of

our domain, of course. But we’re doing something

radically different. I think this is a transformation in the nature of software, which is every bit is profound as the development

of photolithography. Photolithography was

a singular moment in the history of hardware. Ever since the days

of Charles Babbage and gear wheels, vacuum tubes, transistors, logic gates, computer hardware has been

getting faster and cheaper. And then we discovered

how to print large scale integrated circuits

using photolithography. And with that,

a transformation because it went exponential.

That’s Moore’s Law. We’re going to print circuits. And, now, the number

of transitional circuit doubles every 18 months. And as they get

smaller, they get faster. Amazing things happen. That’s why we have

the tech intercepts. Why we’re all carrying supercomputers around

in our pockets? Because of photolithography,

because of Moore’s Law. Something interesting

may be happening in software because

the way we’re creating these solutions is by program the computer to learn from experience and then

training it using data. When I see a

Moore’s law of data, the amount of data in

the world is doubling every maybe year or two. And so we are on

the brink of something tremendously exciting and all pervasive through

machine learning. That’s real, that’s

happening right now. One of the things

that it might lead to is artificial intelligence. But even if it

doesn’t, or even if artificial intelligence

is decades away, this is going to transform

every aspect of our lives. One of the areas that I’m

hoping it’ll transform is health care and that’s

a personal interest of mine, but it will be all pervasive. And I do think it’s

transformational. I’ve got the yin and yang

diagram because I think there’s a kind of flipside of

learning from data, which is quantifying

uncertainty. So, again, go back to

traditional computer science. It’s all about logic. It’s all about zeroes and

ones. Everything is binary. The engineers at Intel

and ARM were really hard to make sure every transistor

is unambiguously on or off. We’re in the world of

learning from data. We’re in the world

of uncertainty. We have to deal with ambiguity, so uncertainty is everywhere. Which movie does

the user want to watch? Which word did they write? What did they say? Which web page are

they trying to find? Which link will they click on? Which gesture are they making? What’s the prognosis for

this patient? And so on. In all cases, we never

have a definitive answer, whatever certain, which link

the users going to click on. But they may be

much more likely to click on one link than another, and we can compute

that likelihood using machine learning. Uncertainty is also a heart

of machine learning. So there’s a transformation from logic to thinking

about uncertainty. Of course, you all know there’s a calculus of uncertainty, which is probability is, again, there are mathematical theorems, which show that if

you’re a rational person and you quantify uncertainty, you will do so

using the rules of probability or something that’s mathematically

equivalent to them. So, again, that’s

a mathematical foundation that’s laid a long time ago. That’s not going to change. This we’re thinking

about just very briefly two perspectives

on probability. What do we mean by probability? Well, when we’re in

school, we usually learn a little bit

about probabilities. We learned the frequentist view, the limits of any

infinite number of trials, a frequency, interpretation

of probability. But I’m sure many

of you know there’s a much broader

interpretation which is probability is a

quantification of uncertainty, and that’s the

Bayesian perspective. It’s almost unfortunate that

both are called probability, but the mathematical

discovery is that if you quantify uncertainty

using real numbers, those numbers behave

exactly the same way as the frequencies with

which dice throws behave. And so we called it probability. The fact we use

the same name for both, I think it’s going to flow

a confusion over the years. Let me just give you

a little example. Hopefully, this will

shed some light on this. So imagine we’ve got a coin,

and the coin is bent. The coin is not equally likely to land hit

at one side of the other. Well, imagine, if

I flip the coin, there is a 60

percent probability it will land concave side up, and a 40 percent probability it will land concave side down. Let’s just imagine

that’s the physics of this particular bent coin. What do we mean by

60 percent probability? We mean if we flip it

many times and compute the fraction of times that

lands concave side up, as we go to the limit of an infinite number of

trials, that fraction, which will be sort

of a noisy thing, it will settle down a little

asymptote to some number, and that number will be.6. That’s the frequentist view

of probabilities. Now, let’s suppose that one side of this coin is heads,

the other side is tails. But imagine you don’t

know which it is. All you know is that

the coin is bent, and there’s a 60

percent probability of landing concave side up. Okay. So, Victor,

I’m going to make a big bet with you

with a thousand dollars about whether

the next coin flips are going to be heads or tails. Now, you’re a very rational

and very intelligent person. How are you going to bet? You’re going to bet 50-50. It’s sort of obvious,

right? It’s symmetry. Victor doesn’t believe that if we repeat

the experiment many, many times, that half the time, it will be heads up

and half the time, it will be heads down. What he believes

is that it could either be 60 percent heads, or it will be 40 percent heads. You see, we are flipping

the same coin each time, but we don’t know which it is. So the frequency with which

it lands concave side up, it’s like a frequentist

probability, but uncertainty about whether the next coin flip

is going to be heads or tails is like

a Bayesian probability. And so imagine I’ve

got this bent coin behind the desk here, and I’m flipping the coin. And I’m honest and truthful, and I’m telling you whether

it’s heads or tails. The more data you collect, the more you can discover about whether heads is on the concave side or heads

is on the convex side. As you collect data, you’re uncertainty

about whether the head’s is concave or convex, that uncertainty

gradually reduces. And then the limit with

the infinite number of trials, there’s no uncertainty

left at all, you’re completely certain

about which is concave and whether the heads is on the concave side

or the convex side. You still don’t know

whether the next coin flip is going to be heads or tails. But let’s say you’re certain that the heads is

on the convex side, and you know this is

a 60 percent probability, the next split will be heads. I hope that illustrates

the difference between Bayesian and

frequentist probabilities. That’s the simplest example

I can think of. At this point, you might be thinking, why am I making so much

fuss about this? Because I’ve said that

in traditional computing, everything is zero or one. And now everything is

going to be described by probabilities which lie

between zero and one, and it seems like a tiny change. It seems like

just a little tweak. So, this is an example, this is my illustration of

why it’s not a little tweak, why it’s a profound difference. So imagine, here’s a bus. And let’s suppose the bus

is longer than the car. And we’ll suppose that the car is longer than the bicycle. Okay. Now again, I know

you’re all smart people. So, if I say, the bus

is longer than the car and the car is

longer than the bicycle, do you all agree that the bus must be longer than the bicycle? Okay. If anybody doesn’t agree, go back to the beginning

of the class or something. That’s a very well

known property. We call it transitivity. And here’s the amazing thing. When we go to the world of

probabilities and uncertainty, transitivity need

no longer apply. And there’s a really

simple example of it. And it’s these things. These are called Efron

dice or nontransitive dice. And they’re standardized except they have unusual

choices of numbers. And let’s say, again, we’re determined to get

some money out of Victor, so, I’m going to make a bet, that we’re gonna

have a game of dice. So, we’re going to roll the dice 11 times an odd number and whoever gets

the greatest numbers of wins is going

to get the money. Well, it turns out

that the orange die will beat the red die, two-thirds of the time. So, two-thirds of the time, the orange number will be bigger than the red

number. Big deal. If I play the orange

against blue, two-thirds of the time, blue will give

a bigger number than orange. two-thirds of the time, green will give

a bigger number than blue. And now, here’s

the amazing thing. The bicycle is also

longer than the bus, because two-thirds of the time, green will give

a bigger number than red. Now, if that isn’t

counter-intuitive, I don’t know what is.

It’s bizarre, right? It’s extraordinary and it’s just a consequence of the fact that these are

uncertain numbers, they’re stochastic numbers. And the way it works,

it’s actually very simple. So, these are the numbers

on the different die. So, the orange one actually always rolls a

three as it happens. On the red one,

two-thirds of the time, you get a two, and one-third of

the time, you get a six. So, it’s obvious that in

two-thirds of the time, orange gives you

a bigger number than red. And I’ll leave it as

an exercise for you to check the others. So, occasionally, in

my copious spare time, I sometimes go and give talks

in schools that sort of try and inspire the next generation with excitement of

Machine Learning, Artificial Intelligence,

Computer Science. We actually hand out packs

of these dice to the kids. And if you go to that link, you can actually

read a little bit more about it and you can see those numbers and check

for yourself this is real. So again, I think this

is quite a profound shift from the world of logic

and determinism to, if you like, the real-world

of uncertainty. At this point, I was going

to show a demo and sadly, I can’t show you the demo. So, in fact, I’m just

going to skip over this. The demo was simply

an example of Machine Learning in

operation where the machine learns about my

preferences for movies. And it actually does

so in real-time. So, as I rate movies

as like or dislike, it’s uncertainty

about which movies are like gradually reduces. So, what you’re seeing in the demo is really if

you like the modern view, I like to call it the modern

view of machine learning, not machine learning

as tuning up parameters by

some optimization process, but instead, machine

learning in the sense that the machine has

a model of the world. In this case,

a very simple world, the world of movies that

I like or don’t like, it has uncertainty about the world, expressed

as probabilities. And as it collects data, that uncertainty reduces,

because it’s learned something, rather like

the coin flip example. And we can think of all of machine learning from

that perspective. What I’m going to do now

is give you a tutorial in about one slide on a favorite subject of mine, Probabilistic Graphical models. Because I’m going

to show you how we’re taking steps towards realizing that dream of

model-based Machine Learning. Not just as a philosophy

of Machine Learning, not just as a compass to guide you through

this complex space, but even as a practical tool that we can use in

real-world applications. And to do this, I’m just

going to need to give you a very quick tutorial

on graphical models. If you know about

graphical models already, this will be very boring, and if you don’t know about

graphical models already, you’re going to learn too much, but at least you’ll

get a sense of it. So imagine, I’ve got two boxes, one of them is green,

one of them is blue. And I’m going to pick one

of these boxes at random, but not necessarily with

a 50, 50 probability. It might be 60, 40 or something. And then we’re going

to describe that by a graphical notation. And this graphical notation, I have a circle representing this uncertain quantity.

So, it’s the value jar. So jar is a binary variable

that’s either green or blue, but it’s not a regular variable. It’s not either green or it’s blue or it’s none or something, it has a probability

of being green or blue, it’s an uncertain variable. And this little box just

describes that probability. Now, imagine, that the boxes contain cookies,

biscuits, as we say. These biscuits are either

circular or triangular. And the proportion of biscuits

is different in each box. So, I can now say,

supposing I go to the green box, the green jar, and I pull out a cookie

without looking, then there’s a one-third

probability that it’ll be triangular and two-thirds

that it will be circular. If I go to the blue jar instead, there’s a one-third

probably it will be circular and two-thirds

it will be triangular. Okay? So again, there’s

some uncertainty. If I draw a cookie

out of the jar, we’re uncertain

about which it is. But we know something,

we know this probability. And so, cookie, again, is an uncertain variable that’s either

triangle or circle. It has some probability, but the value of

that probability, depends upon the value of

this random variable jar. So, we can think of

this model in what we call a generative way in which

I do an experiment, I, first of all, randomly, choose a jar, and

then given that jar, I dip in and I randomly

choose a cookie, and that tells me the value of jar and consequently

the value of cookie. That’s a forward model

and that generates data, generates jars, it

generates cookies. And I could repeat

that many times. Now, in real

applications, typically, in this graph, of course, is describing my prior knowledge

about the world. I know the world

consists of jars and it consists of cookies and they relate to each other

in certain ways. So this graph is

a very visual way of expressing that prior knowledge which is obviously critical in as we’ve

seen in Machine Learning. Typically, what

we do though with these graphs is we

observe something, in this case, you

might observe cookie. Or we want to go the other way, we want to work out which jar

did that cookie come from. So, maybe there’s a 60 percent

chance that it’s green. So, it’s more

likely to be green. But now, when I

observe the cookie, I observe that

the cookie is triangular. Now, your intuition

says, if it’s triangular, it’s more likely that

it came from blue than green. And that’s correct. So when you run

the math, you just go base there and very simple, you’ll find that

the probability that it was jar, shifts a little bit

towards blue. You’re just as

your intuition would expect. And so that, if you like, is the Machine Learning process. We’ve observed that I

like a particular movie, and the internal state of

the machine gets updated using sort of based theorem

on steroids to say, I sure am a bit more likely to like action adventure than romantic comedy or

whatever it might be. And that’s a crash tutorial, but Chapter eight of

this amazing book, I’m sure you will have, I hope. Chapter eight is

a free PDF download and that’s a whole chapter

on graphical models. Okay. So, let me illustrate now. I’m going to pick a particular Machine

Learning algorithm, it’s called PCA or Principal

Components Analysis and something everybody learns

about in Machine Learning 101. And first of all, we’re

going to describe PCA the way you’d normally learn

about it from a textbook. And then, I’m going to

show you how to derive PCA using the

model-based perspective, and we’ll use

those graphical models. So, PCA as an algorithm, it’s like a recipe. It’s a recipe that

you apply to data. First of all, it

says, take the data. So, the data will be vectors in some high dimensional space. And there are n of

them and if it says, first of all, average those

vectors to compute the mean, then subtract the mean of

all of those vectors and compute this thing which is the sample co-variance matrix, then find the eigenvalues and eigenvectors of

the sample co-variance matrix, and then keep the eigenvectors corresponded to

the M largest eigenvalues. That in some sense is compressed the data or projected

it down onto an M dimensional subspace in a way that

preserves variance. So, that’s Principal Components as a recipe and you

can code that out, turn the handle, out

would come the answer, and you’d have no idea why

did you pick that recipe. Maybe it works brilliantly

when your’e done, what if it doesn’t work well? What are you going to do now? How are you going to change the recipe so that

it works better? So, if you have no compass, you’re just left with

random trial and error. So, here’s a much better way

of thinking about things. So this is PCA

viewed as a model. So, in the same way that

we’re going to pick a jar, and then choose

a cookie from the jar, I’m going to describe to you

how to generate the data. Because one way of capturing your prior knowledge is to write down how the data

gets generated. So, in this case, it

says pick a vector from a lower

dimensional subspace, from a Gaussian

distribution having zero mean unit variance,

circular Gaussian distribution. So, pick a vector from

that Gaussian distribution, then project it into this high dimensional space with some linear transformation, the space of your data, and then finally, generate a data point by taking

that projected point. Making up the center of

a Gaussian distribution, another Gaussian

distribution that represents the noise and

pick a sample from that. And so, don’t worry

about the details. It’s just a description of how to choose one of the jars and reach and

then pick a cookie. So choose the low

dimensional vector and then generate the high dimensional vector

by adding noise. Another little notation,

it’s called a Plate. It says just repeat

that process n times. So, it says, put the cookie back in the jar, give

it a good shake, close your eyes

again, pick a jar, pick a cookie from

the jar, do it n times.>>So, that generative

process describes how the data gets generated is a great way to express

our prior knowledge. But when we do machine learning, we’re trying to solve

an inverse problem. We have to go back the other

way, which is much harder. So, we observe the data, and we have to make inferences about the points in

the lower dimensional space, and also the values

of the parameters of this linear transformation. And so, we have

to run inference. And then again, it’s

a mathematical proof that this is identical. If you use what’s called maximum likelihood

to deal inference, that is say if you choose

all the parameters to maximize the probability of

the data onto the model, you exactly recover PCA. Now, this point, you think ‘ah, it’s a lot of work just

to get back to PCA.’ So there’s completely

equivalent. So, why is the model base view

so much better? The reason is that if this doesn’t do what

you wanted to do, you can go back and

examine those assumptions. And you can change

the assumptions to better reflect the problem

you’re trying to solve, and then, rederive the model. You haven’t just got a recipe, you’ve got a procedure for arriving at the best model

for your problem. So, just take a simple example, supposing that these drawings, these generated data points, so not generated independently. So, for example, let’s imagine

I’m air traffic control, and I want to know

where the aeroplane is. The aeroplane is

flying across the sky. And once a second, my radar is going to send

that some energy. It’s going to bounce

off the aeroplane, come back and I receive it, and I make a measurement

of where the aeroplane is. Now, the problem is that

that measurement is noisy. So, if I just make

a single measurement, I’ll know where

the aeroplane is roughly, but there’ll be

some uncertainty. Now, we know that if

that’s just random noise, if I make multiple measurements, I can sort of average

out the noise, and get a more certain estimate of where that aeroplane is. So, we’ve going to make

several measurements. The problem is,

the aeroplane is moving. As I make these

measurements, it’s moving. If I just average

the measurements, that will be great, because

I’ll average out the noise. But I’ll also average

out the location, which is what I’m trying to

find. So that’s bad news. If I don’t average, if I just

use the latest measurement, I won’t be averaging

over the motion, but I have a lot of noise.

So what should I do? Well, you could sort

of have some intuition. You could say, “Hmm, I should take

the latest measurement because that’s where

the aeroplane is, but I’ll add in a bit of

the previous measurements to get rid of some of the noise, maybe a little bit of

the measurement for.” But the measurement from

10 minutes ago is irrelevant. So, have some sort of

a weighted average, or I give more weight to

the more recent measurements. That’s sort of your intuition. Actually, that intuition

turns out to be good. That’s actually

what you should do. But how much weight

should you give? What sort of functions should

you use for this decay? How much should you decay by? How do you know what

to do? Your back in the world of recipes, intuition, trial, and error. So instead of that, let’s build a model, in

which we are very explicit, about all the assumptions

we’re going to make, because that’s more

likely to work better. And if it doesn’t, we know how to change things to improve it. So, we’re going to say that here’s the actual position

of the aeroplane in space. Oh, sorry. This is the actual position of

the aeroplane in space. I think we want to know. We don’t know it. It’s unknown. So, the aeroplane

is in some position. And then, we make a measurement, the measurement is noisy. So this is the noise process, but we know that’s value.

That’s the thing we observe. This is the observed position, which are noisy measurement

of the true position. Given that alone, we could estimate this but

have a lot of uncertainty. What’s going to happen

now is the aeroplane is moving across the sky. We could build a model for that. And the simplest model

that we can have is to assume that the uncertainty and the position of

the aeroplane is Gaussian, that the measurement

noise is Gaussian, and that the movement

of the aeroplane across the sky is described

by linear model. So, given its position

and its velocity, we can compute where it will

be at the next timestep. Now, again, we make

another measurement, another noisy measurement

of that next timestep. Now, the aeroplane moves a little bit further and we make another measurements and so on. So, that’s the

generative process. But now, what we need to

do is to run inference. Given these observations,

we need to compute. We need to revise

the probabilities of these aeroplane locations. So, we cannot just that’s

sort of Bayes’ theorem. It’s a more complicated version

to Bayes’ theorem. And it turns out that

that problem can be solved in a very elegant way computationally by passing

messages around the graph. So, we don’t have

time to go into that. It’s the very

beautiful mathematical solution called message parsing. It’s very generic.

But this thing turns out to have a name. It’s called the Kalman filter. It’s been around since

the 50s or whatever. It’s very standard stuff

electrical engineering. When I was writing

my 2006 textbook, I had a chapter on

these times series model, and I read several books

called Kalman filters, introduction to Kalman filters. I find it pretty impenetrable, and it is very complicated, many many chapters where you finally get to

all of this stuff. This is, by far, the simplest way of deriving the Kalman filter that I know, just derive message parsing

and know its generality, and apply it to

this linear Gaussian model. And you get the Kalman

filter equations, in which you say, the posterior

probability of this, the position of the aeroplane at this time depends upon

all of the measurements. This is more sensitive to the current and so, the

recent measurements. And so, you do get

that decay of the weighting the evidence but in

a very precise way that you derive from

the mathematics. And you can even pass measures the other

direction and send information back in time, and get the better estimate

of where the aeroplane was, but making as

a future measurements. Again, it’s your intuition

would indicate. Guess what? If these are not Gaussian, but supposing they’re

discrete variables, again, you just pass

measures back and forth. Now, it’s called the

hidden Markov model. Well, that’s a completely

different literature with completely different

notation and completely different but

equally impenetrable derivations of how all this goes. Again, it’s just

exactly the same model, just slightly

different assumptions. And maybe this works quite well, maybe it doesn’t work

quite well enough, maybe there is some. So, you try this

out on your problem, you find it’s still

not working quite well enough, you know what to do. It could be, maybe there’s a problem with the data

that you’ve collected, maybe there’s a problem

with the inference, because most inference

algorithms are approximate. For the Kalman filter,

it’s exact. But once you get to

more complex models, you always take

approximate inference. And maybe your inference

algorithm had some issues, or maybe your prior assumptions

were not correct. Maybe you need to refine them for the problem you’re

trying to solve. You know how to do that

because you made an explicit. So, maybe this noise isn’t

Gaussian. Maybe real radar. So you’re going to talk

to radar engineer and find out what the noise really is like and

then, model it. And you get better results. Okay. So, I think it’s more

or less my final slide. What I’ve shown you so far

is really a philosophy, a viewpoint of machine learning

that I hope helps provide you with a compass

to guide you through this complex morass

of algorithms, but also a practical tool to use when you’re building

real world applications. But at the back of

our minds, we have a dream. And the dream is that we

can somehow automate this. We can provide tools so that people who haven’t read

all the textbooks on neural net, I mean, machine

learning and so on. You need to buy

the textbooks, by the way. You don’t need to

read them, just so everybody is

clear about that, or one in particular anyway. But say you read all that stuff and learn all about

this, can we automate it? Can we provide

tools that will help democratize this approach

to machine learning? And so, this is the dream. So, if you think about coding up inference for

a complicated problem, like the movie

recommender example, it’s pretty complicated stuff, thousands of lines of code. It’s written on

machine learning experts. You know about the modeling,

know about inference, know how to code

up the inference in the context of those models. This is all complicated stuff. All written in C++ or whatever

your favorite language is, compiled down to machine code, combined with the data, lots of compute happens and you get your predictions

with uncertainty. What if instead, we could write a thing which we call

a probabilistic program. So, probabilistic program is just a very short piece of code written in

some appropriate language, which effectively describes what that probabilistic model, that graphical model describes. So, it will almost say pick one of the jars

with this probability, and then, for that jar, pick a cookie with

a certain probability; or the aeroplane is in

this position in the sky, and one second later, it’s

moved to a new position, and I’m going to

make measurements. The measurements have

Gaussian noise or something. It’s just a simple description

and a few lines of code, maybe if we’re lucky, tens of lines of code, that describes the generative

process of the data, or describes it

very clear intuitive form the prior knowledge that

we’re baking in to our model. And we are going to

a piece of magic, which is a probabilistic

program compiler, which is going to take

this high level description and generate these 1,000 of

lines of code automatically. So that’s the dream. We

haven’t achieved the dream, but we have made

a lot of progress. We’ve built a compiler. And if you go to infernet, you can download infer.net. And there’s lots of tutorials

and examples and so on. And infer.net doesn’t

cover every possible case, but it covers a lot

of common cases. And for those cases

which it is applicable, you do have this automation. And of course,

the whole time, we are looking to extend it

and generalize it. So, it is quite an exciting

program for search. And so, we’re going to

leave you with this. This is the graphical model with the random variables,

the probabilities, and the plates for

the movie recommender problem, the problem of

recommending movies. So here, we have uses. So, I’ll stand back so

I can read the writing. Okay. User bias feature weights. So, what we’ve got here are

features about the user. It might be age, gender, geographic location,

anything which might influence

what movies are like. Here, we’ve got

features of the items. So it might be

the duration of the movie, the actors, whatever,

perhaps genre, action, adventure,

romantic, comedy and so on. And then, we also have

in here information, which is we call

collaborative filtering. So that’s people who’ve like the movies you’ve liked so far also like

these other movies, so perhaps, you’ll like them. But not coded up as some sort of hacky piece of intuition, but just described by

probabilistic model, a very precise

probabilistic model. And so, this can be cast in a few dozen lines

of infer.net code. And then, the

inference algorithm can be compiled automatically. And so, right down

here, we have the thing we observe. That’s the ratings. That’s somebody saying I

like this movie or I don’t or this movie has five stars,

this is one star. Once we make observations from a user about

which movies they like, we send information, we

pass machines up this graph, revise probabilities for

these hidden variables, send messages back down again, and we get revised probabilities which are ratings for movies

the person hasn’t yet seen. And so we update the probability

I’m going to like some unseen movie based on ratings I’ve given

to movies I have seen, plus all the ratings

that thousands of other people have given to

that movie and other movies. That’s how that works. And again that’s all recorded

in infer.net. And so we leave you

with another book, but the good news is this book

is online, it’s free. Will be forever more. It’s called Model-Based

Machine Learning. It’s co-authored with John Winn and John has actually done overwhelmingly the bulk

of the work on this book. So I’m very much

the second author. It’s really John’s baby. This is a very unusual book. There’s a little introductory

chapter but there after, every chapter is

a real world case study. We’ve chosen examples from Microsoft because that’s

what we know about, and these are things

we’ve worked on. And in each case, we

start with the problem. The problem we’re

trying to solve, we’re trying to match different players on Xbox

so they’ll have a good game, in other words, they’ll be

similarly matched in strength. That’s the problem

we’re trying to solve. We’ll describe

the data that we have, we’ll describe

the prior knowledge, the assumptions

we’re going to make. We derive the Machine

Learning algorithm, we test it out from the data, we find it doesn’t

work very well, because that is what

happens in practice. Anybody who’s ever tried Machine Learning

in the real world, the first thing you try

generally doesn’t work. And then we go

back to debugging, was there a problem with

the data that we collected? Was there a problem

with the inference, the approximate inference

algorithms we used? Or was it a problem with

the assumptions that we made? And so we go and we revise

the assumptions and then run it again and of course every chapter has a happy ending. We get good results and it ships and is used by

millions of people. But it’s a little bit more honest about

the process by which we arrived at those solutions

and it shows you how, I hope, for each

of these examples, and they’re drawn from

very different domains, medical examples and so on, in each case, hopefully you can see how by making

the assumptions explicit, that critical prior knowledge, by making it explicit, it gives you a compass to

guide you through the process of revising and refining the solution and getting

it to work properly. Otherwise you’re left

with a big space of trial and error and not

knowing what to try next. So with that, thank

you very much.>>Thank you Chris

for that great talk. And we’ll probably take

a couple of questions and then, yeah, you have the mics? Okay, so this hand

going up first maybe one mic for the gentleman here.>>Hello sir, I’m [inaudible].

Thank you for the very nice talk. So I have one question that, by restricting to the class

of probabilistic models, are we losing something? What is your thought on

that? Because, there are neural nets which are

not probabilistic models.>>Yeah. Several

thoughts on that. First of all,

the probabilistic view of Machine Learning

is a general one. So the qualification

of uncertainty using probabilities is the only rational way to deal

with uncertainty. In practice, we often can’t deal with

probabilities exactly, we generally have to

make approximations. One extreme approximation

is a point estimate. So we place some complicated distribution with

a single value. That single value will be

chosen in some way that might be maximum

likelihood for example. So if you’re taking a neural net

and you’re training by minimizing error which

is a lot likelihood, a lot [inaudible]

noise distribution, then you’re approximating

that probabilistic inference, may it’d be a very

drastic approximation. And the bigger, the

more complex the model, the more data you have, the more performant you need to be, typically, the more radical the approximations you have to make in order to get something

that’s tractable and sufficient to

perform your application. So it is quite general.

Generally speaking though, although you may not be able to maintain full

probability distributions of all of the internal variables like you did in

the movie example, so all the internal weights

of the neural net. Nevertheless, the outputs almost invariably should

be probabilities. So I would say as a rule, whenever you’re

making predictions, they should always be

probabilistic predictions. One of the problems

of Support Vector Machines they’re just

intrinsically with no probabilistic and there are ways of fixing

it up afterwards. So when you make

a probabilistic prediction, instead of saying this person

has cancer or they don’t, you say there’s a 37 percent

chance they have cancer. First of all, you can threshold it and it’s back to a decision, but can do so much more. For example, maybe the cost of taking somebody

who has cancer, misdiagnosing as

not having cancer is much worse than

say somebody who’s healthy and diagnosing with cancer because in

the first case they might die in the second

case they might get upset and need

some further tests. So that loss measure is are very asymmetric

and if you’ve got probabilities up you can take that into

account correctly. You can use

probabilities to combine the outputs from

multiple systems so like a Euro of uncertainty or a universal currency you can

combine different systems. You can do things at

threshold you can say, I’m going to make

a decision when my confidence is

above a certain level, if my confidence is

below that level, I’m going to send off a human. So if you’ve got some very repetitive

task medical screening where people staring down

microscopes all day long, you might be able

to help them by just taking 90 percent of

the data and I’m very confident this is not

cancerous but everything else we’re going to leave

to human judgment. That’s a very practical

thing to do today. So lots and lots of

advantage of having probabilistic predictions

and no downside. It’s always, always

your probabilities. Okay, Masam has a question

then we’ll take one more. So two more questions

then we’ll wrap. I’m Masam from IoT Delhi. Thank you for

the very nice talk. I learnt my AI in

the early 2000s, and that was the time

probabilistic graphical models

were at the peak, and I’m a application

researcher, I work in Natural

Language processing and I remember conferences

where pretty much all the papers except a very few were all probabilistic

graphical models based sometimes at some point

it became LDM based and so on and so forth. Of course, there’s

a new world order, and so I find very few papers in the application area and I’m

not talking about people who look in the 2D and the

fundamentals of Machine Learning. There’s a lot of work

still going on in there and some unsupervised

learning as well. But in the application domains, it is all neural networks

left right and centre and Probabilistic

Graphical models are either not being tried or have been overtaken and life has changed. So I want to understand your perspective in the future as you know in the time to come, what do you see as the role of Probabilistic

Graphical models based solutions in application areas. Do you believe that they will

still have a strong role to play or do you believe that they will be overtaken

by neural networks? If they will have a role to play would it be in

conjunction with neural? What is the value it will

offer when it’s using a BGM solution the right

solution to approach?>>Sure. So you’ve got to understand that

Machine Learning like everything else is

a social enterprise. All right. So let’s

take neural networks. There was tremendous

excitement in 1960s around positrons

because machines could learn and you could cut 10 percent of

the wires and it carried on working just not quite

as well just like the brain. So

tremendous excitement. Then all went away again,

and then all came back again in the 1980s, 1990s. Neural nets were

the solution to everything. Then it all went away again

and then all came back and. So right now it’s all- And

in the application domain, we’ve got these

particular techniques, certain classes of

covolution nets and LCMs, and a handful of things. So working very well on certain

problems for which we can get lots of data we

can label up by hand. Many, many practical

applications. So it’s unsurprising that this tremendous focus of applications is bearing

down on this one set up. We’ve discovered

this new technique and everyone applying it in all kinds of

places. That’s unsurprising. If you step back and look at the field of Machine Learning, it’s a very broad field and this discriminative

training based on hand level data

was one tiny corner, which has all kinds of, I think you know the last

speaker covered some of these, there are so many

limitations that are scratching the surface of what we want to do

with machine learning. The whole world of reinforced learning,

unsupervised learning, that [inaudible] and

somebody mentioned the work it burned

choked and others, they’re all issues about

bias and in learning. Think of the world of

Machine Learning as this enormous opportunity that’s out there in front of us, and then right now there’s

a whole bunch of people, for understandable

and good reasons, focused on music and particular

technical applications. So first of all, probabilities

are the foundation, there’s a mathematical

theorem that says if you’re behaving rationally

and you’re not certain, you’re going to use

probabilities or something equivalent. So

it’s not going to go away. I don’t th ink the Maths

going to change. The graphical models are

just a very beautiful notation. Personally, I find

a picture is worth a thousand equations and it’s just much easier to look

at a picture and see what it’s saying than

pages and pages of Maths. So I don’t think

the pictures are going to go away any time soon. But your question is really

about practical applications, and there’re

so many applications we’ll be working

on applications, you’ll see examples in the book. Where just throwing a neural network is not

the right way to go. We’re actually

graphical models are the appropriate tool

and technique to use. So I can’t predict what

the next wave is going to be, maybe reinforcement

learning will dig in and get

some real traction, everyone will lurch

across and start applying reinforcement

to everything. But in terms of

the field Machine learning, what an amazing time to

be going into the field. We’re just at

the beginning of this. My son is at university

doing Computer Science. He’s interested in

Machine Learning. I think well, that’s great.

There’s a whole career to be built in this because we’re just at

the beginning of this.>>But just a tiny follow up. So you said that you tried neural networks in some

applications that didn’t work, I’m really happy to

hear that but can you characterize what kinds

of settings do you expect neural networks

to not do well where PDM would be the solution

in unsupervised scenario?>>Sure. Just an

example would be the skill matching

example in Xbox. So again, it’s

a chapter in the book. Where what are your assumptions? When you’ve got

some players and they have some skill and you have

some uncertainty in their skills which we described

actually by the simplest possible [inaudible] as

the Gaussian distribution. And then they play

against each other, and I know you have some model

for how their performance varies because the stronger

player will sometimes lose to the weaker player

because they didn’t play too well in

that particular game. And that’s how we

model all of that. And in fact,

actually if you take that model and look at just

the maximum likelihood limit, we throw away the uncertainty you come up with

something called elo, which is the standard method

used in chess worldwide. So, that’s a model which is appropriate to

that particular application. So again, it all comes down to getting the sort of

fundamental point of it all, is that there isn’t such a thing as a universal algorithm. Again there’s a mathematical

theorem that proves that, it’s about building

the right kind of solution that’s tailored

to your problem. So, you’ll see some examples

of that in the book.>>Okay, we’ll take

one last question here. Second row yeah.>>Hi. This is [inaudible] from

Ministry of Technology, Delhi. And, I’m a PhD student there. And it is really

heartening to see you talking more about the Probabilistic

Graphical Models. I work in Probabilistic

Graphical Models and at times in these times, it becomes scary that there

I’m working in the right area where the whole world is

talking about Deep Learning, so it gives me

a sense of security. First that a person like you is propagating that.

Thanks for that. So, the question is, basically

so you shift the onus from the algorithms to the Probabilistic Models

and assumptions. But then, when you walk into the Probabilistic

Graphical Models another question arises

like how you choose. I think the same problem

gets shifted to how you choose which

approximate inference algorithm to choose from? Like if I work in

structured prediction there are ample amount of approximate

inference techniques, variational inference to MC MC. There is some

understanding on that, but I think the problem

has just shifted to what approximate

inference algorithm you will use for the

Probabilistic Graphical Model. That is the first question

and the second is at a higher level. So you talk about that there is no single algorithm as

such and you have to adapt, you have to see the problem, understand assumptions, and then see which algorithms work there. On the contrary philosophy of, if I understand it correctly, Pedro Domingos talk about the Master Algorithm

which will work, I believe we’re almost. So what are

your thoughts on that?>>Okay yeah. First of all

just I don’t want to say the impression is

Graphical Models over here and there’s Neural Nets

over there and you choose one or you

choose the other. Deep Learning is

the ability to train these deep hierarchical

layered structures and you might describe your

problem by a graphical model, but maybe one of

those conditional probabilities is

a Deep Neural Net. So these are not alternatives

they’re more like, I think of the

Probabilistic framework in the graphical model as a way of, again it’s more like

a compass to guide your way around the world of

Machine Learning. Deep Learning is a very powerful technique and

it’s cropping up in many, many different places

it will be used a lot. So, I don’t want to characterize

them as alternatives, but I do like

the Graphical Models as sort of a general framework

for describing models. So, sorry the second part

of the question was the. Yeah okay so.>>[inaudible].>>I mean just the lack

of time I’ve said very little about

Approximate Inference. Again, that model based

machine learning book guides you through some of

the inference methods that we’re using in that context. And again in real world

applications you make approximations and

those approximations, you know you might have a complicated

multi-modal distribution, you might approximate

by Gaussian which is uni-modal and you’re losing

some sort of uncertainty, some ambiguity there, and that may or may

not be important. So part of the challenge is, when you don’t get

the results you need is diagnosing where

the problem goes wrong. So making the bad assumptions, inappropriate assumptions

is just one of the places. If somebody hands

you rubbish data that isn’t what it claims to be, then you can just

get bad results even if your assumptions

are correct. And the same thing with

the Inference Algorithm, that’s a whole very complex

world in its own right. And in essence

the goal of infer.net is to hide that from you. You can focus as

the domain expert on your prior knowledge that you know because you’re an expert in Medical Imaging because

you’re an oncologist, or whatever, you don’t have to know anything

about inference. And the ultimate treat of infer.net is the inference

will be entirely automatic and we’re not there yet, but we’ve made progress.>>Okay. I think we should

wrap up for now because we all need that kind of

[inaudible] time. I’ll request [inaudible] to say

a vote of thanks for Chris, yeah. Thank you very much.

Thank you Chris.

You say that the model of a person restricts the expressions of freedom so that less data is needed to conclude that any data is a person – fine. But then you say that a convolutional layer represent a model of a person when it clearly is not. Rather the convolutional layer has parameters which iteratively change according to an algorithm to converge towards to a structured representation of a person. Clearly the product is the model we're after and the convolutional layer is more like a 'meta model'. It seems reasonable that the more degrees of freedom this meta model allows for the more kinds of models it can derive, right?

Part of the confusion about applicability of GPM was due to the fact that, after 2014 , November there was no release for infer.net. Is there any work happening or , is it getting interated to CNTK?

Why he is holding water for the whole talk ? lol

32:33 He said it the wrong way, actually the red is better than green and that's why it is amazing.

At 9:08 he compares a simple problem (electrical current prediction given a certain voltage) with a much complicated problem of image recognition. IMHO to differentiate between computationally and statistically big data the targeted problem should be the same. So for instance the plane recognition problem, thousands of pictures of the similar type of aeroplane are computationally large but statistically insufficient. However may be a reduced number of pictures of various models may suffice statistically.