ANNOUNCER: The following program

is brought to you by Caltech. YASER ABU-MOSTAFA: Welcome to machine

learning, and welcome to our online audience as well. Let me start with an outline of the

course, and then go into the material of today’s lecture. As you see from the outline, the topics

are given colors, and that designates their main content, whether

it’s mathematical or practical. Machine learning is

a very broad subject. It goes from very abstract theory to

extreme practice as in rules of thumb. And the inclusion of a topic in the

course depends on the relevance to machine learning. So some mathematics is useful because it

gives you the conceptual framework, and then some practical aspects are

useful because they give you the way to deal with real learning systems. Now if you look at the topics, these

are not meant to be separate topics for each lecture. They just highlight the main

content of those lectures. But there is a story line that goes

through it, and let me tell you what the story line is like. It starts here with: what is learning? Can we learn? How to do it? How to do it well? And then the take-home lessons. There is a logical dependency that goes

through the course, and there’s one exception to that

logical dependency. One lecture, which is the third one,

doesn’t really belong here. It’s a practical topic, and the reason

I included it early on is because I needed to give you some tools to play

around with, to test the theoretical and conceptual aspects. If I waited until it belonged normally,

which is to the second aspect of the linear models which is down there, the

beginning of the course would be just too theoretical

for people’s taste. And as you see, if you look at the

colors, it is mostly red in the beginning and mostly blue in the end. So it starts building the

concepts and the theory. And then it goes on to the

practical aspects. Now, let me start today’s lecture. And the subject of the lecture

is the learning problem. It’s an introduction to

what learning is. And I will draw your attention

to one aspect of this slide, which is this part. That’s the logo of the course. And believe it or not,

this is not artwork. This is actually a technical

figure that will come up in one of the lectures. I’m not going to tell you which one. So you can wait in anticipation until it

comes up, but this will actually be a scientific figure that

we will talk about. Now when we move to today’s

lecture, I’m going to talk today about the following. Machine learning is a very broad

subject, and I’m going to start with one example that captures the

essence of machine learning. It’s a fun example about movies

that everybody watches. And then after that, I’m going to

abstract from the learning problem, the practical learning problem,

aspects that are common to all learning situations that

you’re going to face. And in abstracting them, we’ll have the

mathematical formalization of the learning problem. And then we will get our first algorithm

for machine learning today. It’s a very simple algorithm, but it

will fix the idea about what is the role of an algorithm in this case. And we will survey the types of learning,

so that we know which part we are emphasizing in this course,

and which parts are nearby. And I will end up with a puzzle, a very

interesting puzzle, and it’s a puzzle in more ways than

one, as you will see. OK, so let me start with an example. The example of machine learning that

I’m going to start with is how a viewer would rate a movie. Now that is an interesting problem, and

it’s interesting for us because we watch movies, and very interesting for

a company that rents out movies. And indeed, a company which is Netflix

wanted to improve the in-house system by a mere 10%. So they make recommendations when you

log in, they recommend movies that they think you will like, so they think

that you’ll rate them highly. And they had a system, and they

wanted to improve the system. So how much is a 10% improvement in

performance worth to the company? It was actually $1 million that was

paid out to the first group that actually managed to get

the 10% improvement. So you ask yourself, 10% improvement

in something like that, why should that be worth a million dollars? It’s because, if the recommendations

that the movie company makes are spot on, you will pay more attention to the

recommendation, you are likely to rent the movies that they recommend, and they

will make lots of money– much more than the million dollars

they promised. And this is very typical

in machine learning. For example, machine learning has

applications in financial forecasting. You can imagine that the minutest

improvement in financial forecasting can make a lot of money. So the fact that you can actually push

the system to be better using machine learning is a very attractive aspect of

the technique in a wide spectrum of applications. So what did these guys do? They gave the data, and people started

working on the problem using different algorithms, until someone managed

to get the prize. Now if you look at the problem of

rating a movie, it captures the essence of machine learning, and the

essence has three components. If you find these three components in

a problem you have in your field, then you know that machine learning is

ready as an application tool. What are the three? The first one is that

a pattern exists. If a pattern didn’t exist, there

would be nothing to look for. So what is the pattern here? There is no question that the way

a person rates a movie is related to how they rated other movies, and is

also related to how other people rated that movie. We know that much. So there is a pattern

to be discovered. However, we cannot really pin

it down mathematically. I cannot ask you to write a 17th-order

polynomial that captures how people rate movies. So the fact that there is a pattern,

and that we cannot pin it down mathematically, is the reason why we

are going for machine learning. For “learning from data”. We couldn’t write down the system on our

own, so we’re going to depend on data in order to be able

to find the system. There is a missing component

which is very important. If you don’t have that,

you are out of luck. We have to have data. We

are learning from data. So if someone knocks on my door with

an interesting machine learning application, and they tell me how

exciting it is, and how great the application would be, and how much

money they would make, the first question I ask, what data do you have? If you data, we are in business. If you don’t, you are out of luck. If you have these three components,

you are ready to apply machine learning. Now let me give you a solution to the

movie rating, in order to start getting a feel for it. So here is a system. Let me start to focus on part of it. We are going to describe a viewer

as a vector of factors, a profile if you will. So if you look here for example, the

first one would be comedy content. Does the movie have a lot of comedy? From a viewer point of view,

do they like comedies? Here, do they like action? Do they prefer blockbusters, or

do they like fringe movies? And you can go on all the way, even to

asking yourself whether you like the lead actor or not. Now you go to the content of the

movie itself, and you get the corresponding part. Does the movie have comedy? Does it have action? Is it a blockbuster? And so on. Now you compare the two, and you realize

that if there is a match– let’s say you hate comedy and the

movie has a lot of comedy– then the chances are you’re

not going to like it. But if there is a match between so many

coordinates, and the number of factors here could be

really like 300 factors. Then the chances are you’ll

like the movie. And if there’s a mismatch, the

chances are you’re not going to like the movie. So what do you do, you match the movie and the viewer

factors, and then you add the contributions of them. And then as a result of that, you

get the predicted rating. This is all good except for one problem,

which is this is really not machine learning. In order to produce this thing, you have

to watch the movie, and analyze the content. You have to interview the viewer,

and ask about their taste. And then after that, you combine

them and try to get a prediction for the rating. Now the idea of machine learning is that

you don’t have to do any of that. All you do is sit down and sip on your

tea, while the machine is doing something to come up with

this figure on its own. So let’s look at the

learning approach. So in the learning approach, we know

that the viewer will be a vector of different factors, and different

components for every factor. So this vector will be different

from one viewer to another. For example, one viewer will have a big

blue content here, and one of them will have a small blue content,

depending on their taste. And then, there is the movie. And a particular movie will have different

contents that correspond to those. And the way we said we are computing the

rating, is by simply taking these and combining them and

getting the rating. Now what machine learning will do is

reverse-engineer that process. It starts from the rating, and then

tries to find out what factors would be consistent with that rating. So think of it this way. You start, let’s say, with

completely random factors. So you take these guys, just random

numbers from beginning to end, and these guys, random numbers

from beginning to end. For every user and every movie,

that’s your starting point. Obviously, there is no chance in the

world that when you get the inner product between these factors that are

random, that you’ll get anything that looks like the rating that actually

took place, right? But what you do is you take a rating

that actually happened, and then you start nudging the factors ever so

slightly toward that rating. Make the direction of the inner product

get closer to the rating. Now it looks like a hopeless thing. I

start with so many factors, they are all random, and I’m trying to

make them match a rating. What are the chances? Well the point is that you are going to

do this not for one rating, but for a 100 million ratings. And you keep cycling through

the 100 million, over and over and over. And eventually, lo and behold, you

find that the factors now are meaningful in terms of the ratings. And if you get a user, a viewer here,

that didn’t watch a movie, and you get the vector that resulted from that

learning process, and you get the movie vector that resulted from that

process, and you do the inner product, lo and behold, you get a rating which

is actually consistent with how that viewer rates the movie. That’s the idea. Now this actually, the solution I

described, is one of the winning solutions in the competition

that I mentioned. So this is for real, this

actually can be used. Now with this example in mind,

let’s actually go to the components of learning. So now I would like to abstract from the

learning problems that I see, what are the mathematical components that

make up the learning problem? And I’m going to use a metaphor. I’m going to use a metaphor now from

another application domain, which is a financial application. So the metaphor we are going to

use is credit approval. You apply for a credit card, and the

bank wants to decide whether it’s a good idea to extend a credit

card for you or not. From the bank’s point of view,

if they’re going to make money, they are happy. If they are going to lose money,

they are not happy. That’s the only criterion they have. Now, very much like we didn’t have

a magic formula for deciding how a viewer will rate a movie, the bank

doesn’t have a magic formula for deciding whether a person

is creditworthy or not. What they’re going to do, they’re going

to rely on historical records of previous customers, and how their credit

behavior turned out, and then try to reverse-engineer the system, and

when they get the system frozen, they’re going to apply it

to a future customer. That’s the deal. What are the components here? Let’s look at it. First, you have the applicant

information. And the applicant information– you

look at this, and you can see that there is the age, the gender, how much

money you make, how much money you owe, and all kinds of fields that are

believed to be related to the creditworthiness. Again, pretty much like we did in

the movie example, there is no question that these fields are related

to the creditworthiness. They don’t necessarily uniquely

determine it, but they are related. And the bank doesn’t want a sure bet.

They want to get the credit decision as reliable as possible. So they want to use that pattern,

in order to be able to come up with a good decision. And they take this input, and they want

to approve the credit or deny it. So let’s formalize this. First, we are going to

have an input. And the input is called

x. Surprise, surprise! And that input happens to be

the customer application. So we can think of it as

a d-dimensional vector, the first component is the salary, years in

residence, outstanding debt, whatever the components are. You put it as a vector, and

that becomes the input. Then we get the output y. The output

y is simply the decision, either to extend credit or not to extend

credit, +1 and -1. And being a good or bad customer, that

is from the bank’s point of view. Now we have after that,

the target function. The target function is a function

from a domain X, which is the set of all of these x’s. So it is the set of vectors

of d dimensions. So it’s a d-dimensional Euclidean

space, in this case. And then the Y is the set of y’s. Well, that’s an easy one because

y can only be +1 or -1, accept or deny. And therefore this is just

a binary co-domain. And this target function is the ideal

credit approval formula, which we don’t know. In all of our endeavors in machine

learning, the target function is unknown to us. If it were known, nobody

needs learning. We just go ahead and implement it. But we need to learn it because

it is unknown to us. So what are we going

to do to learn it? We are going to use data, examples. So the data in this case is based on

previous customer application records. The input, which is the information in

their applications, and the output, which is how they turned

out in hindsight. This is not a question of prediction

at the time they applied, but after five years, they turned out

to be a great customer. So the bank says, if someone has

these attributes again, let’s approve credit because these guys

tend to make us money. And this one made us lose a lot of

money, so let’s deny it, and so on. And the historical records– there are

plenty of historical records. All of this makes sense when you’re

talking about having 100,000 of those guys. Then you pretty much say, I will

capture what the essence of that function is. So this is the data, and then you use

the data, which is the historical records, in order to

get the hypothesis. The hypothesis is the formal name we’re

going to call the formula we get to approximate the target function. So the hypothesis lives in the same

world as the target function, and if you look at the value of g, it supposedly

approximates f. While f is unknown to us,

g is very much known– actually we created it– and the hope

is that it does approximate f well. That’s the goal of learning. So this notation will be our notation

for the rest of the course, so get used to it. The target function is always f, the

hypothesis we produce, which we’ll refer to as the final hypothesis will be

called g, the data will always have that notation– there are capital N

points, which are the data set. And the output is always y. So this is the formula to be used. Now, let’s put it in a diagram in order

to analyze it a little bit more. If you look at the diagram

here, here is the target function and it is unknown– that is the ideal approval which we will

never know, but that’s what we’re hoping to get to approximate. And we don’t see it. We see it only through the eyes

of the training examples. This is our vehicle of understanding

what the target function is. Otherwise the target function is

a mysterious quantity for us. And eventually, we would like to

produce the final hypothesis. The final hypothesis is the formula the

bank is going to use in order to approve or deny credit, with the hope

that g hopefully approximates that f. Now what connects those two guys? This will be the learning algorithm. So the learning algorithm takes the

examples, and will produce the final hypothesis, as we described in the

example of the movie rating. Now there is another component that

goes into the learning algorithm. So what the learning algorithm does, it

creates the formula from a preset model of formulas, a set of candidate

formulas, if you will. And these we are going to call the

hypothesis set, a set of hypotheses from which we are going to

pick one hypothesis. So from this H comes a bunch of small

h’s, which are functions that can be candidates for being the

credit approval. And one of them will be picked by the

learning algorithm, which happens to be g, hopefully approximating f. Now if you look at this part of the

chain, from the target function to the training to the learning algorithm to

the final hypothesis, this is very natural, and nobody will

object to that. But why do we have this

hypothesis set? Why not let the algorithm

pick from anything? Just create the formula, without being

restricted to a particular set of formulas H. There are two reasons, and

I want to explain them. One of them is that there is no downside

for including a hypothesis set in the formalization. And there is an upside. So let me describe why there is no

downside, and then describe why there is an upside. There is no downside for the simple

reason that, from a practical point of view, that’s what you do. You want to learn, you say I’m going

to use a linear formula. I’m going to use a neural network. I’m going to use a support

vector machine. So you are already dictating

a set of hypotheses. If you happen to be a brave soul, and you

don’t want to restrict yourself at all, very well, then your hypothesis

set is the set of all possible hypotheses. Right? So there is no loss of generality

in putting it. So there is no downside. The upside is not obvious here, but it

will become obvious as we go through the theory. The hypothesis set will play a pivotal

role in the theory of learning. It will tell us: can we learn, and

how well we learn, and whatnot. Therefore having it as an explicit

component in the problem statement will make the theory go through. So that’s why we have this figure. Now, let me focus on the solution

components of that figure. What do I mean by the

solution components? If you look at this, the first part,

which is the target– let me try to expand it– so the target function is

not under your control. Someone knocks on my door and says:

I want to approve credit. That’s the target function, I

have no control over that. And by the way, here are

the historical records. I have no control over that,

so they give me the data. And would you please hand me

the final hypothesis? That is what I’m going to give them at

the end, before I receive my check. So all of that is completely dictated. Now let’s look at the other part. The

learning algorithm, and the hypothesis set that we talked about,

are your solution tools. These are things you choose, in

order to solve the problem. And I would like to take a little bit

of a look into what they look like, and give you an example of them, so that

you have a complete chain for the entire figure in your mind. From the target function, to the data

set, to the learning algorithm, hypothesis set, and the

final hypothesis. So, here is the hypothesis set. We chose the notation H for the

set, and the element will be given the symbol small h. So h is a function, pretty much

like the final hypothesis g. g is just one of them

that you happen to elect. So when we elect it, we call it g. If

it’s sitting there generically, we call it h. And then, when you put them together,

they are referred to as the learning model. So if you’re asked what is the learning

model you are using, you’re actually choosing both a hypothesis

set and a learning algorithm. We’ll see the perceptron in a moment,

so this would be the perceptron model, and this would be the

perceptron learning algorithm. This could be neural network, and

this would be back propagation. This could be support vector

machines of some kind, let’s say radial basis function version, and this

would be the quadratic programming. So every time you have a model, there is

a hypothesis set, and then there is an algorithm that will do the

searching and produce one of those guys. So this is the standard form for the solution. Now, let me go through a simple

hypothesis set in detail so we have something to implement. So after the lecture, you can actually

implement a learning algorithm on real data if you want to. This is not a glorious model. It’s

a very simple model. On the other hand, it’s a very clear model to pinpoint

what we are talking about. So here is the deal. You have an input, and the input

is x_1 up to x_d, as we said– d-dimensional vector– and each of them

comes from the real numbers, just for simplicity. So this belongs to the real numbers. And these are the attributes

of a customer. As we said, salary, years in

residence, and whatnot. So what does the perceptron model do? It does a very simple formula. It takes the attributes you have and

gives them different weights, w. So let’s say the salary is important,

the chances are w corresponding to the salary will be big. Some other attribute is

not that important. The chances are the w that

goes with it is not that big. Actually, outstanding

debt is bad news. If you owe a lot, that’s not good. So the chances are the weight will

be negative for outstanding debt, and so on. Now you add them together, and you add

them in a linear form– that’s what makes it a perceptron– and you can look at this as

a credit score, of sorts. Now you compare the credit

score with a threshold. If you exceed the threshold, they

approve the credit card. And if you don’t, they

deny the credit card. So that is the formula they settle on. They have no idea, yet, what the w’s and

the threshold are, but they dictated the formula– the analytic form that

they’re going to use. Now we take this and we put it

in the formalization we had. We have to define a hypothesis h,

and this will tell us what is the hypothesis set that has all the

hypotheses that have the same functional form. So you can write it down as this. This is a little bit long, but there’s

absolutely nothing to it. This is your credit score, and this

is the threshold you compare to by subtracting. If this quantity is positive, you belong

to the first thing and you will approve credit. If it’s negative, you belong here

and you will deny credit. Well, the function that takes a real

number, and produces the sign +1 or -1, is called the sign. So when you take the sign of this thing,

this will indeed be +1 or -1, and this will give

the decision you want. And that will be the form

of your hypothesis. Now let’s put it in color, and you

realize that what defines h is your choice of w_i and the threshold. These are the parameters that define

one hypothesis versus the other. x is an input that will be

put into any hypothesis. As far as we are concerned, when we are

learning for example, the inputs and outputs are already determined. These are the data set. But what we vary to get one hypothesis

or another, and what the algorithm needs to vary in order to choose the

final hypothesis, are those parameters which, in this case, are

w_i and the threshold. So let’s look at it visually. Let’s assume that the data

you are working with is linearly separable. Linearly separable in this case, for

example, you have nine data points. And if you look at the nine data

points, some of them were good customers and some of them

were bad customers. And you would like now to apply the

perceptron model, in order to separate them correctly. You would like to get to this situation,

where the perceptron, which is this purple line, separates the blue

region from the red region or the pink region, and indeed all the good

customers belong to one, and the bad customers belong to the other. So you have hope that a future customer,

if they lie here or lie here, they will be classified

correctly. If there is actually a simple linear

pattern to this to be detected. But when you start, you start with

random weights, and the random weights will give you any line. So the purple line in both

cases corresponds to the purple parameters there. One choice of these w’s and the

threshold corresponds to one line. You change them, you get another line. So you can see that the learning

algorithm is playing around with these parameters, and therefore moving the

line around, trying to arrive at this happy solution. Now we are going to have a simple

change of notation. Instead of calling it threshold, we’re

going to treat it as if it’s a weight. It was minus threshold.

Now we call it, plus w_0. Absolutely nothing, all you need

to do is choose w_0 to be minus the threshold. No big deal. So why do we do that? We do that because we are going to

introduce an artificial coordinate. Remember that the input

was x_1 through x_d. Now we’re going to add x_0. This is not an attribute of

the customer, but an artificial constant we add, which

happens to be always +1. Why are we doing this?

You probably guessed. Because when you do that, then all of

a sudden the formula simplifies. Now you are summing from i equals

0, instead of i equals 1. So you added the zero term,

and what is the zero term? It’s the threshold which you

conveniently call w_0 with a plus sign, multiplied by the 1. So indeed, this will be the formula

equivalent to that. So it looks better. And this is the standard notation

we’re going to use. And now we put it as a vector

form, which will simplify matters, so in this case you will be having an inner

product between a vector w, a column vector, and a vector x. So the vector w would be w_0,

w_1, w_2, w_3, w_4, et cetera. And x_0, x_1, x_2, et cetera. And you do the inner product by taking

a transpose, and you get a formula which is exactly the formula

you have here. So now we are down to this formula

for the perceptron hypothesis. Now that we have the hypothesis set,

let’s look for the learning algorithm that goes with it. The hypothesis set tells you the

resources you can work with. Now we need the algorithm that is

going to look at the data, the training data that you’re going to use,

and navigate through the space of hypotheses, to bring the one that

is going to output as the final hypothesis that you give

to your customer. So this one is called the perceptron

learning algorithm, and it implements this function. What it does is the following. It takes the training data. That is always what a learning

algorithm does. This is their starting point. So it takes existing customers, and

their existing credit behavior in hindsight– that’s what it uses– and what does it do? It tries to make the w correct. So it really doesn’t like at all

when a point is misclassified. So if a point is misclassified,

it means that your w didn’t do the right job here. So what does it mean to be

a misclassified point here? It means that when you apply your

formula, with the current w– the w is the one that the algorithm

will play with– apply it to this particular x. Then what happens? You get something that is not the

credit behavior you want. It is misclassified. So what do we do when a point

is misclassified? We have to do something. So what the algorithm does, it

updates the weight vector. It changes the weight, which changes

the hypothesis, so that it behaves better on that particular point. And this is the formula that it does. So I’ll explain it in a moment. Let me first try to explain the inner

product in terms of agreement or disagreement. If you have the vector x and the vector

w this way, their inner product will be positive, and the sign

will give you a +1. If they are this way, the inner product

will be negative, and the sign will be -1. So being misclassified means that

either they are this way and the output should be -1, or it’s this

way and output should be +1. That’s what makes it misclassified,

right? So if you look here at this formula, it

takes the old w and adds something that depends on the misclassified

point. Both in terms of the x_n and y_n. y_n is just +1 or -1. So here you are either adding a vector

or subtracting a vector. And we will see from this diagram that

you’re always doing so in such a way that you make the point more likely

to be correctly classified. How is that? If y equals +1, as you see here,

then it must be that since the point is misclassified, that

w dot x was negative. Now when you modify this to w plus

y x, it’s actually w plus x. You add x to w, and when you add x to

w you get the blue vector instead of the red vector. And lo and behold, now the inner

product is indeed positive. And in the other case when it’s -1,

it is misclassified because they were this way. They give you +1 when

it should be -1. And when you apply the rule, since

y is -1, you are actually subtracting x. So you subtract x and get this guy,

and you will get the correct classification. So this is the intuition behind it. However, it is not the intuition

that makes this work. There are a number of problems

with this approach. I just motivated that

this is not a crazy rule. Whether or not it’s a working

rule, that is yet to be seen. Let’s look at the iterations of

the perceptron learning algorithm. Here is one iteration of PLA. So you look at this thing, and you have

this current w corresponds to the purple line. This guy is blue in the red region. It means it’s misclassified. So now you would like to adjust

the weights, that is move around that purple line, such that the

point is classified correctly. If you apply the learning rule, you’ll

find that you’re actually moving in this direction, which means that the

blue point will likely be correctly classified after that iteration. There is a problem because, let’s

say that I actually move this guy in this direction. Well this one, I got it right, but this

one, which used to be right, now is messed up. It moved to the blue region, right? And if you think about it, I’m trying

to take care of one point, and I may be messing up all other points, because

I’m not taking them into consideration. Well, the good news for the perceptron learning algorithm is that all you need

to do, is for iterations 1, 2, 3, 4, et cetera, pick a misclassified

point, anyone you like. And then apply the iteration to it. The iteration we just talked about,

which is this one. The top one. And that’s it. If you do that, and the data was

originally linearly separable, then you will end up with the case that you

will get to a correct solution. You will get to something that

classifies all of them correctly. This is not an obvious statement. It requires a proof. The proof is not that hard. But it gives us the simplest possible

learning model we can think of. It’s a linear model, and

this is your algorithm. All you need to do is be very patient,

because 1, 2, 3, 4– this is a really long. At times it can be very long. But it eventually converges. That’s the promise, as long as the data is

linearly separable. So now we have one learning model, and

if I give you now data from a bank– previous customers and their credit

behavior– you can actually run the perceptron learning algorithm, and come up

with a final hypothesis g that you can hand to the bank. Not clear at all that it will be good,

because all you did was match the historical records. Well, you may ask the question: if I

match the historical records, does this mean that I’m getting future customers

right, which is the only thing that matters? The bank already knows what happened

with the previous customers. It’s just using the data to help you

find a good formula. The formula will be good or not good to

the extent that it applies to a new customer, and can predict the

behavior correctly. Well, that’s a loaded question

which will be handled in extreme detail, when we talk about

the theory of learning. That’s why we have to develop

all of this theory. So, that’s it. And that is the perceptron

learning algorithm. Now let me go into the bigger picture

of learning, because what I talked about so far is one type of learning. It happens to be by far the most

popular, and the most used. But there are other types of learning. So let’s talk about the premise of

learning, from which the different types came about. That’s what learning is about. This is the premise that is common

between any problem that you would consider learning. You use a set of observations,

what we call data, to uncover an underlying process. In our case, the target function. You can see that this is

a very broad premise. And therefore, you can see that people

have rediscovered that over and over and over, in so many disciplines. Can you think of a discipline, other than

machine learning, that uses that as its exclusive premise? Anybody have taken courses

in statistics? In statistics, that’s what they do. The underlying process is

a probability distribution. And the observations are samples

generated by that distribution. And you want to take the samples, and

predict what the probability distribution is. And over and over, there are so many

disciplines under different names. Now when we talk about different types

of learning, it’s not like we sit down and look at the world and say, this

looks different from this because the assumptions look different. What you do is, you take this premise

and apply it in a context. And that calls for a certain amount

of mathematics and algorithms. If a particular set of assumptions takes

you sufficiently far from the mathematics and the algorithms you used

in the other disciplines, that it takes on a life of its own. And it develops its own math and

algorithms, then you declare it a different type. So when I list the types, it’s not

completely obvious just by the slide itself, that these should be

the types that you have. But for what it’s worth, these

are the most important types. First one is supervised learning,

that’s what we have been talking about. And I will discuss it in detail, and tell

you why it’s called supervised. And it is, by far, the concentration

of this course. There is another one which is called

unsupervised learning, and unsupervised learning

is very intriguing. I will mention it briefly here, and then

we will talk about a very famous algorithm for unsupervised learning

later in the course. And the final type is reinforcement

learning, which is even more intriguing, and I will

discuss it in a brief introduction in a moment. So let’s take them one by one. Supervised learning. So what is supervised learning? Anytime you have the data that is

given to you, with the output explicitly given– here is the user

and movie, and here is the rating. Here is the previous customer, and

here is their credit behavior. It’s as if a supervisor is helping you

out, in order to be able to classify the future ones. That’s why it’s called supervised. Let’s take an example of coin

recognition, just to be able to contrast it with unsupervised

learning in a moment. Let’s say you have a vending machine,

and you would like to make the system able to

recognize the coins. So what do you do? You have physical measurements of the

coin, let’s be simplistic and say we measure the size and mass

of the coin you put. Now the coins will be quarters,

nickels, pennies, and dimes. 25, 5, 1, and 10. And when you put the data in this

diagram, they will belong there. So the quarters, for example, are

bigger, so they will belong here. And the dimes in the US currency happen

to be the smallest of them, so they are smallest here, and there

will be a scatter because of the error in measurement, because of the exposure

to the elements, and whatnot. So let’s say that this is your

training data, and it’s supervised because things are colored. I gave you those and told you they

are 25 cents, 5 cents, et cetera. So you use those in order to train

a system, and the system will then be able to classify a future one. For example, if we stick to the

linear approach, you may be able to find separator lines like those. And those separator lines will

separate, based on the data, the 10 from the 1 from the

5 from the 25. And once you have those, you can bid farewell to the data.

You don’t need it anymore. And when you get a future coin that is

now unlabeled, you don’t know what it is, when the vending machine is actually

working, then the coin will lie in one region or another, and you’re

going to classify it accordingly. So that is supervised learning. Now let’s look at unsupervised

learning. For unsupervised learning, instead of

having the examples, the training data, having this form which is the

input plus the correct target– the correct output– the customer and how they behaved

in reality in credit, we are going to have examples that have

less information, so much less it is laughable. I’m just going to tell you

what the input is. And I’m not going to tell you what

the target function is at all. I’m not going to tell you anything

about the target function. I’m just going to tell you, here

is the data of a customer. Good luck, try to predict the credit. OK– How in the world are we

going to do that? Let me show you that the situation

is not totally hopeless. That’s what I’m going to achieve. I’m not going to tell you

how to do it completely. But let me show you that a situation

like that is not totally hopeless. Let’s go for the coin example. For the coin example, we have

data that looks like this. If I didn’t tell you what the

denominations are, the data would look like this. Right? You have the measurements, but you don’t

know, is that a quarter, is it– you don’t know. Now honestly, if you look at this

thing, you say I can know something from this figure. Things tend to cluster together. So I may be able to classify those

clusters into categories, without knowing what the categories are. That will be quite

an achievement already. You still don’t know whether it’s

25 cents, or whatever. But the data actually made you

able to do something that is a significant step. You’re going to be able to come

up with these boundaries. And now, you are so close to

finding the full system. So unlabeled data actually

can be pretty useful. Obviously, I have seen the colored

ones, so I actually chose the boundaries right because I still

remember them visually. But if you look at the clusters and

you have never heard about that, especially these guys might not

look like two clusters. They may look like one cluster. So it actually could be that this is

ambiguous, and indeed in unsupervised learning, the number of clusters

is ambiguous at times. And then, what you do– this is the output of your system.

Now, I can categorize the coins into types. I’m just going to call them

types: type 1, type 2, type 3, type 4. I have no idea which belongs to which,

but obviously if someone comes with a single example of a quarter, a dime,

et cetera, then you are ready to go. Whereas before, you had to have lots of

examples in order to choose where exactly to put the boundary. And this is why a set like that,

which looks like complete jungle, is actually useful. Let me give you another interesting

example of unsupervised learning, where I give you the input without the

output, and you are actually in a better situation to learn. Let’s say that your company or your

school in this case, is sending you for a semester in Rio de Janeiro. So you’re very excited, and you

decide that you’d better learn some Portuguese, in order to be able to

speak the language when you arrive. Not to worry, when you arrive, there

will be a tutor who teaches you Portuguese. But you have a month to go,

and you want to help yourself as much as possible. You look around, and you find that the

only resource you have is a radio station in Portuguese in your car. So what you do, you just turn

it on whenever you drive. And for an entire month, you’re

bombarded with Portuguese. “tudo bem”, “como vai”, “valeu”,

stuff like that comes back. After a while, without knowing anything–

it’s unsupervised, nobody told you the meaning of any word– you start to develop a model of

the language in your mind. You know what the idioms

are, et cetera. You are very eager to know

what actually “tudo bem” — what does that mean? You are ready to learn, and once

you learn it, it’s actually fixed in your mind. Then when you go there, you will learn

the language faster than if you didn’t go through this experience. So you can think of unsupervised

learning, in one way or another, as a way of getting a higher-level

representation of the input. Whether it’s extremely high level as

in clusters– you forgot all the attributes and you just tell me a label,

or higher level as in this– a better representation than just the

crude input into some model in your mind. Now let’s talk about

reinforcement learning. In this case, it’s not as bad

as unsupervised learning. So again, without the benefit of

supervised learning, you don’t get the correct output. What you do is– I will

give you the input. OK, thank you very much,

that’s very kind. What else? I’m going to give you some output. The correct output? No! Some output. OK, that’s very nice, but doesn’t

seem very helpful. It looks now like unsupervised learning,

because in unsupervised learning I could give you some output. Here is a dime. Oh, it’s a quarter. It’s some output! Such output has no information. The information comes from the next one. I’m going to grade this output. So that is the information

provided to you. So I’m not explicitly giving you the

output, but when you choose an output, I’m going to tell you how

well you’re doing. Reinforcement learning is interesting

because it is mostly our own experience in learning. Think of a toddler, and a hot

cup of tea in front of her. She is looking at it, and

she is very curious. So she reaches to touch. Ouch! And she starts crying. The reward is very negative

for trying. Now next time she looks at it, and she

remembers the previous experience, and she doesn’t touch it. But there is a certain level of pain,

because there is an unfulfilled curiosity. And curiosity killed the cat. In

three or four trials, the toddler tries again. Maybe now it’s OK. And Ouch! Eventually from just the grade of the

behavior of to touch it or not to touch it, the toddler will learn not to

touch cups of tea that have smoke coming out of them. So that is a case of

reinforcement learning. The most important application, or one

of the most important applications of reinforcement learning, is

in playing games. So backgammon is one of the games,

and think that you want a system to learn it. So what you want, you want to take the

current state of the board, and you roll the dice, and then you decide

what is the optimal move in order to stand the best chance to win. That’s the game. So the target function is the

best move given a state. Now, if I have to generate those things

in order for the system to learn, then I must be a pretty good

backgammon player already. So now it’s a vicious cycle. Now, reinforcement learning

comes in handy. What you’re going to do, you

are going to have the computer choose any output. A crazy move, for all you care. And then see what happens eventually. So this computer is playing against

another computer, both of them want to learn. And you make a move, and eventually

you win or lose. So you propagate back the credit

because of winning or losing, according to a very specific and

sophisticated formula, into all the moves that happened. Now you think that’s completely hopeless,

because maybe this is not the move that resulted in this,

it’s another move. But always remember, that you are going

to do this 100 billion times. Not you, the poor computer. You’re sitting down sipping

your tea. A computer is doing this, playing

against an imaginary opponent, and they keep playing and

playing and playing. And in three hours of CPU time, you go

back to the computer– maybe not three hours, maybe three days of CPU time–

you go back to the computer, and you have a backgammon champion. Actually, that’s true. The world champion, at some point, was

a neural network that learned the way I described. So it is actually a very attractive

approach, because in machine learning now, we have a target function

that we cannot model. That covers a lot of territory,

I’ve seen a lot of those. We have data coming from

the target function. I usually have that. And now we have the lazy

man’s approach to life. We are going to sit down, and let the

computer do all of the work, and produce the system we want. Instead of studying the thing

mathematically, and writing code, and debugging– I hate debugging. And then you go. No,

we’re not going to do that. The learning algorithm just works,

and produces something good. And we get the check. So this is a pretty good deal. It actually is so good, it might

be too good to be true. So let’s actually examine if

all of this was a fantasy. So now I’m going to give you

a learning puzzle. Humans are very good learners, right? So I’m now going to give you a learning

problem in the form that I described, a supervised

learning problem. And that supervised learning problem

will give you a training set, some points mapped to +1, some

points mapped to -1. And then I’m going to give you

a test point that is unlabeled. Your task is to look at the examples,

learn the target function, apply it to the test point, and then decide what

the value of the function is. After that, I’m going to ask, who

decided that the function is +1, and who decided that the

function is -1. OK? It’s clear what the deal is. And I would like our online audience

to do the same thing. And please text what the solution is. Just +1 or -1. Fair enough? Let’s start the game. What is above the line are

the training examples. I put the input as a three-by-three

pattern in order to be visually easy to understand. But this is just really nine

bits worth of information. And they are ones and zeros,

black and white. And for this input, this input, and this

input, the value of the target function is -1. For this input, this input, and this

input, the value of the target function is +1. Now this is your data set, this

is your training set. Now you should learn the function. And when you’re done, could you please

tell me what your function will return on this test point? Is it +1 or -1. I will give everybody 30 seconds

before I ask for an answer. Maybe we should have some

background music? OK, time’s up. Your learning algorithm

has converged, I hope. And now we apply it here, and I ask

people here, who says it’s +1? Thank you. Who says it’s -1? Thank you. I see that the online audience

also contributed? MODERATOR: Yeah, the big

majority says +1. PROFESSOR: But

are there -1’s? MODERATOR: Two -1’s. PROFESSOR: Cool. I don’t care if it’s

a +1 or -1. What I care about is that

I get both answers. That is the essence of it. Why do I care? Because in reality, this

is an impossible task. I told you the target

function is unknown. It could be anything,

really anything. And now I give you the value of the

target function at 6 points. Well, there are many functions that

fit those 6 points, and behave differently outside. For example, if you take the function

to be +1 if the top left square is white, then this should

be -1, right? If you take the function to be +1

if the pattern is symmetric– let’s see, I said it

the other way around. So the top one is black,

it would be -1. So this would be -1. If it’s symmetric, it would be +1. So this would be +1, because

this guy has both– this is black, and also it is symmetric. Right? And you can find infinite

variety like that. And that problem is not restricted

to this case. The question here is obvious. The function is unknown. You really mean unknown, right? Yes, I mean it. Unknown– anything? Yes, I do. OK. You give me a finite sample,

it can be anything outside. How in the world am I going to tell

what the learning outside is? OK, that sounds about right. But we are in trouble, because that’s

the premise of learning. If the goal was to memorize the examples

I gave you, that would be memorizing, not learning. Learning is to figure out a pattern

that applies outside. And now we realize that outside,

I cannot say anything. Does this mean that learning

is doomed? Well, this is going to be

a very short course! Well, the good news is that learning

is alive and well. And we are going to show that, without

compromising our basic premise. The target function will

continue to be unknown. And we still mean unknown. And we will be able to learn. And that will be the subject

of the next lecture. Right now, we are going to go for

a short break, after which we are going to take the Q&A. We’ll start the Q&A, and we will get

questions from the class here, and from the online audience. And if you’d like to ask a question, let

me ask you to go to this side of the room where the mic is, so that

your question can be heard. And we will alternate, if there are

questions here, we will alternate between campus and off campus. So let me start if there is

a question from outside. MODERATOR: Yes, so the most common

question is, how do you determine if a set of points is linearly

separable, and what do you do if they’re not separable. PROFESSOR: The linear separability

assumption is a very simplistic assumption, and doesn’t

apply mostly in practice. And I chose it only because it goes with

a very simple algorithm, which is the perceptron learning algorithm. There are two ways to deal with the

case of linear inseparability. There are algorithms, and most

algorithms actually deal with that case, and there’s also a technique that

we are going to study next week, which will take a set of points

which is not linearly separable, and create a mapping that makes

them linearly separable. So there is a way to deal with it. However, the question how do you

determine it’s linearly separable, the right way of doing it in practice is

that, when someone gives you data, you assume in general it’s not

linearly separable. It will hardly ever be, and therefore

you take techniques that can deal with that case as well. There is a simple modification of the

perceptron learning algorithm, which is called the pocket algorithm, that applies the same rule with a very

minor modification, and deals with the case where the data is not separable. However, if you apply the perceptron

learning algorithm, that is guaranteed to converge to a correct solution in the

case of linear separability, and you apply it to data that is not

linearly separable, bad things happen. Not only is it going not to converge,

obviously it is not going to converge because it terminates when there are

no misclassified points, right? If there is a misclassified point, then

there’s a next iteration always. So since the data is not linearly

separable, we will never come to a point where all the points

are classified correctly. So this is not what is bothering us. What is bothering us is that, as you go

from one step to another, you can go from a very good solution

to a terrible solution. In the case of no linear separability. So it’s not an algorithm that you

would like to use, and just terminate by force at an iteration. A modification of it can be used this

way, and I’ll mention it briefly when we talk about linear regression

and other linear methods. MODERATOR: There’s also a question of

how does the rate of convergence of the perceptron change with the

dimensionality of the data? PROFESSOR: Badly! That’s the answer. Let me put it this way. You can build pathological cases, where

it really will take forever. However, I did not give the perceptron

learning algorithm in the first lecture to tell you that this is

the great algorithm that you need to learn. I gave it in the first lecture,

because this is simplest algorithm I could give. By the end of this course,

you’ll be saying, what? Perceptron? Never heard of it. So it will go out of contention, after we

get to the more interesting stuff. But as a method that can be used, it

indeed can be used, and can be explained in five minutes

as you have seen. MODERATOR: Regarding the items for

learning, you mentioned that there must be a pattern. So can you be more specific about that? How do you know if there’s a pattern? PROFESSOR: You don’t. My answers seem to be very abrupt,

but that’s the way it is. When we get to the theory–

is learning feasible– it will become very clear that there is

a separation between the target function– there is

a pattern to detect– and whether we can learn it. It is very difficult for me to explain

it in two minutes, it will take a full lecture to get there. But the essence of it is that you take

the data, you apply your learning algorithm, and there is something you

can explicitly detect that will tell you whether you learned or not. So in some cases, you’re not

going to be able to learn. In some cases, you’ll be able to learn. And the key is that you’re going

to be able to tell by running your algorithm. And I’m going to explain that

in more details later on. So basically, I’m also resisting

taking the data, deciding whether it’s linearly separable, looking

at it and seeing. You will realize as we go through that it’s

a no-no to actually look at the data. What? That’s what data is for, to look at. Bear with me. We will come to the level where we ask

why don’t we look at the data– just looking at it and then saying:

It’s linearly separable. Let’s pick the perceptron. That’s bad practice, for reasons

that are not obvious now. They will become obvious, once we

are done with the theory. So when someone knocks on my door with

a set of data, I can ask them all kinds of questions about the data– not

the particular data set that they gave me, but about the general data that

is generated by their process. They can tell me this variable is

important, the function is symmetric, they can give you all kinds of

information that I will take to heart. But I will try, as much as I can, to

avoid looking at the particular data set that they gave me, lest I should

tailor my system toward this data set, and be disappointed when another

data set comes about. You don’t want to get too

close to the data set. This will become very clear

as we go with the theory. MODERATOR: In general about

machine learning, how does it relate to other statistical, especially

econometric techniques? PROFESSOR: Statistics is, in

the form I said, it’s machine learning where the target– it’s not a function in this case–

is a probability distribution. Statistics is a mathematical field. And therefore, you put the assumptions

that you need in order to be able to rigorously prove the results you have,

and get the results in detail. For example, linear regression. When we talk about linear regression, it

will have very few assumptions, and the results will apply to a wide range,

because we didn’t make too many assumptions. When you study linear regression under

statistics, there is a lot of mathematics that goes with it, lot of

assumptions, because that is the purpose of the field. In general, machine learning tries to make

the least assumptions and cover the most territory. These go together. So it is not a mathematical discipline,

but it’s not a purely applied discipline. It spans both the mathematical, to

certain extent, but it is willing to actually go into territory where we

don’t have mathematical models, and still want to apply our techniques. So that is what characterizes

it the most. And then there are other fields.

By doing machine learning, you can find it under the name

computational learning, or statistical learning. Data mining has a huge intersection

with machine learning. There are lots of disciplines around

that actually share some value. But the point is, the premise that you

saw is so broad, that it shouldn’t be surprising that people at different times

developed a particular discipline with its own jargon, to deal

with that discipline. So what I’m giving you is machine

learning as the mainstream goes, and that can be applied as widely as

possible to applications, both practical applications and

scientific applications. You will see, here is a situation, I

have an experiment, here is a target, I have the data. How do I produce the target

in the best way I want? And then you apply machine learning. MODERATOR: Also, in a general

question about machine learning. Do machine learning algorithms perform

global optimization methods, or just local optimization methods? PROFESSOR: Obviously,

a general question. Optimization is a tool

for machine learning. So we will pick whatever optimization

that does the job for us. And sometimes, there is a very

specific optimization method. For example, in support vector

machines, it will be quadratic programming. It happens to be the one

that works with that. But optimization is not something

that machine learning people study for its own sake. They obviously study it to understand

it better, and to choose the correct optimization method. Now, the question is alluding

to something that will become clear when we talk about neural

networks, which is local minimum versus global minimum. And it is impossible to put this in

any perspective before we get the details of neural networks,

so I will defer that until we get to that lecture. MODERATOR: Also, this is

a math question, I guess. Is the hypothesis set, in a topological

sense, continuous? PROFESSOR: The hypothesis

set can be anything, in principle. So it can be continuous,

and it can be discrete. For example, in the next lecture I take

the simplest case where we have a finite hypothesis set, in order

to make a certain point. In reality, almost all the hypothesis

sets that you find are continuous and infinite. Very infinite! And the level of sophistication

of the hypothesis set can be huge. And nonetheless, we will be able to see

that under one condition, which comes from the theory, we’ll be able to

learn even if the hypothesis set is huge and complicated. There’s a question from inside, yes? STUDENT: I think I understood, more or

less, the general idea, but I don’t understand the second example

you gave about credit approval. So how do we collect our data? Should we give credit to everyone, or

should we make our data biased, because we cannot determine

the data of– we can’t determine, should we give credit

or not to persons we rejected? PROFESSOR: Correct. This is a good point. Every time

someone asks a question, the lecture number comes to my mind. I know when I’m going

to talk about it. So what you describe is

called sampling bias. And I will describe it in detail. But when you use the biased data, let’s

say the bank uses historical records. So it sees the people who applied and

were accepted, and for those guys, it can actually predict what the credit

behavior is, because it has their credit history. They charged and repaid and maxed

out, and all of this. And then they decide: is this

a good customer or not? For those who were rejected, there’s

really no way to tell in this case whether they were falsely rejected,

that they would have been good customers or not. Nonetheless, if you take the customer

base that you have, and base your decision on it, the boundary

works fairly decently. Actually, pretty decently, even for the

other guys, because the other guys usually are deeper into the

classification region than the boundary guys that you accepted,

and turned out to be bad. But the point is well taken. The data set in this case is not

completely representative, and there is a particular principle in learning

that we’ll talk about, which is sampling bias, that deals

with this case. Another question from here? STUDENT: You explain that we need

to have a lot of data to learn. So how do you decide how much amount

of data that is required for a particular problem, in order to be

able to come up with a reasonable– PROFESSOR: Good question. So let me tell you the theoretical,

and the practical answer. The theoretical answer is that this is

exactly the crux of the theory part that we’re going to talk about. And in the theory, we are going

to see, can we learn? And how much data. So all of this will be answered

in a mathematical way. So this is the theoretical answer. The practical answer is: that’s

not under your control. When someone knocks on your door: Here

is the data, I have 500 points. I tell him, I will give you

a fantastic system if you just give me 2000. But I don’t have 2000, I have 500. So now you go and you use your theory

to do something to your system, such that it can work with the 500. There was one case– I worked with data in different

applications– at some point, we had almost

100 million points. You were swimming in data. You wouldn’t complain about data. Data was wonderful. And in another case, there were

less than 100 points. And you had to deal with

the data with gloves! Because if you use them the wrong way,

they are contaminated, which is an expression we will see, and

then you have nothing. And you will produce a system, and you

are proud of it, but you have no idea whether it will perform well or not. And you cannot give this to the customer,

and have the customer come back to you and say: what did you do!? So there is a question of, what

performance can you do given what data size you have? But in practice, you really have no

control over the data size in almost all the cases, almost all

the practical cases. Yes? STUDENT: Another question I have

is regarding the hypothesis set. So the larger the hypothesis set

is, probably I’ll be able to better fit the data. But that, as you were explaining, might

be a bad thing to do because when the new data point comes,

there might be troubles. So how do you decide

the size of your– PROFESSOR: You are asking all

the right questions, and all of them are coming up. This is again part of the theory,

but let me try to explain this. As we mentioned, learning is about

being able to predict. So you are using the data, not to

memorize it, but to figure out what the pattern is. And if you figure out a pattern that

applies to all the data, and it’s a reasonable pattern, then you

have a chance that it will generalize outside. Now the problem is that, if I give you

50 points, and you use a 7000th-order polynomial, you will fit the

heck out of the data. You will fit it so much with so many

degrees of freedom to spare, but you haven’t learned anything. You just memorized it in a fancy way. You put it in a polynomial form, and

that actually carries all the information about the

data that you have, and then some. So you don’t expect at all that

this will generalize outside. And that intuitive observation

will be formalized when we talk about the theory. There will be a measurement of the

hypothesis set that you give me, that measures the sophistication of it,

and will tell you with that sophistication, you need that amount

of data in order to be able to make any statement about generalization. So that is what the theory is about. STUDENT: Suppose, I mean, here

whatever we discussed, it is like I had a data set and I came up with

an algorithm, and gave the output. But won’t it be also important to see,

OK, we came up with the output, and using that, what was the feedback? Are there techniques where you take

the feedback and try to correct your– PROFESSOR: You are alluding

to different techniques here. But one of them would be validation,

which is after you learn, you validate your solution. And this is an extremely established and

core technique in machine learning that will be covered in

one of the lectures. Any questions from the online audience? MODERATOR: In practice, how many

dimensions would be considered easy, medium, and hard for

a perceptron problem? PROFESSOR: The hard, in most people’s mind before they

get into machine learning, is the computational time. If something takes a lot of time,

then it’s a hard problem. If something can be computed quickly,

it’s an easy problem. For machine learning, the bottleneck

in my case, has never been the computation time, even in

incredibly big data sets. The bottleneck for machine learning is

to be able to generalize outside the data that you have seen. So to answer your question, the

perceptron behaves badly in terms of the computational behavior. We will be able to predict its

generalization behavior, based on the number of dimensions and

the amount of data. This will be given explicitly. And therefore, the perceptron algorithm

is bad computationally, good in terms of generalization. If you actually can get away with

perceptrons, your chances of generalizing are good because

it’s a simplistic model, and therefore its ability to

generalize is good, as we will see. MODERATOR: Also, in the example you

explain the use of binary function. So can you use more multi-valued

or real functions? PROFESSOR: Correct. Remember when I told you that there is

a topic that is out of sequence. There was a logical sequence to the

course, and then I took part of the linear models and put it very early on,

to give you something a little bit more sophisticated than perceptrons

to try your hand on. That happens to be for

real-valued functions. And obviously there are hypotheses that

cover all types of co-domains. Y could be anything as well. MODERATOR: Another question is, in

the learning process you showed, when do you pick your learning algorithm,

when do you pick your hypothesis set, and what liberty do you have? PROFESSOR: The hypothesis set

is the most important aspect of determining the generalization behavior

that we’ll talk about. The learning algorithm does play a role,

although it is a secondary role, as we will see in the discussion. So in general, the learning

algorithm has the form of minimizing an error function. So you can think of the

perceptron, what does the algorithm do? It tries to minimize the

classification error. That is your error function, and

you’re minimizing it using this particular update rule. And in other cases, we’ll see that we

are minimizing an error function. Now the minimization aspect is

an optimization question, and once you determine that this is indeed the

error function that I want to minimize, then you go and minimize

as much as you can using the most sophisticated optimization

technique that you find. So the question now translates into

what is the choice of the error function or error measure that

will help or not help. And that will be covered also next week

under the topic, Error and Noise. When I talk about error, we’ll talk

about error measures, and this translates directly to the learning

algorithm that goes with them. MODERATOR: Back to the perceptron. So what happens if your hypothesis

gives you exactly 0 in this case? PROFESSOR: So remember that

the quantity you compute and compare with the threshold

was your credit score. So I told you what happens if you are

above threshold, and what happens if you’re below threshold. So what happens if you’re exactly

at the threshold? Your score is exactly that. The informal answer is that it depends

on the mood of the credit officer on that day. If they had a bad day,

you will be denied! But the serious answer is that

there are technical ways of defining that point. You can define it as 0,

so the sign of 0 is 0. In which case you are always making

an error, because you are never +1 or -1, when you should be. Or you could make it belong

to the +1 category or to the -1 category. There are ramifications for

all of these decisions that are purely technical. Nothing conceptual comes out of them. That’s why I decided not

to include it. Because it clutters the main concept

with something that really has no ramification. As far as you’re concerned, the easiest

way to consider it is that the output will be 0, and therefore you will

be making an error regardless of whether it’s +1 or -1. MODERATOR: Is there a kind of problem

that cannot be learned even if there’s a huge amount of data? PROFESSOR: Correct. For example, if I go to my computer

and use a pseudo-random number generator to generate the target over

the entire domain, then patently, nothing I can give you will make

you learn the other guys. So remember the three– let me try to– the essence of machine learning. The first one was, a pattern exists. If there’s no pattern that exists,

there is nothing to learn. Let’s say that it’s like a baby,

and stuff is happening, and the baby is just staring. There is nothing

to pick from that thing. Once there is a pattern, you can see

the smile on the baby’s face. Now I can see what is going on. So whatever you are learning,

there needs to be a pattern. Now, how to tell that there’s

a pattern or not, that’s a different question. But the main ingredient, there’s a pattern.

The other one is we cannot pin it down mathematically. If we can pin it down mathematically,

and you decide to do the learning, then you

are really lazy. Because you could just write the code. But fine. You can use learning in this case, but

it’s not the recommended method, because it has certain errors

in performance. Whereas if you have the mathematical

definition, you just implement it and you’ll get the best possible solution. And the third one, you have data,

which is key. So you have plenty of data, but the

first one is off, you are simply not going to learn. And it’s not like I have to answer each

of these questions at random. The theory will completely

capture what is going on. So there’s a very good reason for going

through four lectures in the outline that are

mathematically inclined. This is not for the sake of math. I don’t like to do math

hacking, if you will. I pick the math that is necessary

to establish a concept. And these will establish it, and they

are very much worth being patient with and going through. Because once you’re done with them, you

basically have it cold about what are the components that make learning

possible, and how do we tell, and all of the questions that have been asked. MODERATOR: Historical question. So why is the perceptron often

related with a neuron? PROFESSOR: I will discuss this

in neural networks, but in general, when you take a neuron and synapses, and

you find what is the function that gets to the neuron, you find that the

neuron fires, which is +1, if the signal coming to it, which is roughly

a combination of the stimuli, exceeds a certain threshold. So that was the initial inspiration, and

the initial inspiration was that: the brain does a pretty good

job, so maybe if we mimic the function, we will get something good. But you mimic one neuron, and then you

put it together and you’ll get the neural network that you

are talking about. And I will discuss the analogy with

biology, and the extent that it can be benefited from, when we talk

about neural networks, because that will be the more proper

context for that. MODERATOR: Another question is,

regarding the hypothesis set, are there Bayesian hierarchical procedures

to narrow down the hypothesis set? PROFESSOR: OK. The choice of the hypothesis set and

the model in general is model selection, and there’s quite a bit of

stuff that we are going to talk about in model selection, when we

talk about validation. In general, the word Bayesian was

mentioned here– if you look at machine learning, there are

schools that deal with the subject differently. So for example, the Bayesian school

puts a mathematical framework completely on it. And then everything can be derived,

and that is based on Bayesian principles. I will talk about that at the very

end, so it’s last but not least. And I will make a very specific point

about it, for what it’s worth. But what I’m talking about in the course

in all of the details, are the most commonly useful methods

in practice. That is my criterion for inclusion. So I will get to that

when we get there. In terms of a hierarchy, there are a number of hierarchical

methods. For example, structural risk

minimization is one of them. There are methods of hierarchies,

and the ramifications of it in generalization. I may touch upon it, when I get

to support vector machines. But again, there’s a lot of theory,

and if you read a book on machine learning written by someone from pure

theory, you would think that you are reading about a completely

different subject. It’s respectable stuff, but

different from the other stuff that is practiced. So one of the things that I’m trying to

do, I’m trying to pick from all the components of machine learning, the

big picture that gives you the understanding of the concept, and

the tools to use it in practice. That is the criterion for inclusion. Any questions from the inside here? OK, we’ll call it a day, and

we’ll see you on Thursday.

Excellent Introductory Lesson Professor 🙂

Thank you for uploading it online 😀

Awesome videos !!!!

I wasn't able to understand the movie's example completely

What an Excellent Professor!, Caltech Please upload more courses.

This was a brilliant introduction to machine learning !

This professor is awesome and funny at the same time.

audio glitches for anyone else? cuts out momentarily quite often?

How many students are there in this class?

who on earth disliked and downvoted this video?..the professor's teaching from just the first lesson has inspired me to research into other ML ANNs beside just the perceptron like the Hopfield NN and gain a megatonne of info…from just the first damned lesson..and i was simply clicking on recommendations goddamit…had no clue about ML..those damned lucky students

Nicely taught professor.

What a great teacher!

shout out to the professor who gets into the material 3 minutes into the course

The professor explains everything so clearly! The video and audio have high quality! The questions asked by the students have high quality!

I don't encourage people to watch this lecture at 1.25x speed.

is this a bachelor or master lever course?

Seriously, this is the best lecture on #ML available on web. Brilliant explanation and the examples at the beginning (specially the Credit approval one) is well explained and fits exactly into introduction to ML.

(Don't read this comment.)

Stopped at 32:35. What are x, y, mathcal{H} in the movie ratings example? A guess:

x = (viewer ID, movie ID) in mathcal{R}^2

y = +1 or -1 depending on whether the rating is good or bad

h(x) = sign(w(viewer(x)) . t(movie(x)))

where

viewer(x) is the first coordinate of x, and movie(x) is the second coordinate of x;

w returns a viewer's vector, and t returns a movie's vector.

This was really an awesome session

+1

+1

Do Caltech have Optimal Control video?

The best lecture series I have seen so far.

Very simple, practical and understandable. Thanks

One of the best teachers I have ever seen! Congratulations, Professor Mustafa!

I think it is the best machine learning introduction I have watched so far

Is it feasable to watch this course in 2017?(As it was recorded in 2013)

Top Class Lecture. Great Professor. Brilliant.

33 people are misclassified.

Great lecture

This lecturer is incredible <3 I wish I could learn from them more directly, and will be looking out for opportunities to do so! Thank you for making this available 🙂

watching at 0.75x

非常飘准的口音

great courses!

This kind of learning resource is useful for providing professionals with the right intuition and conceptual framework in the technology of Machine Learning. More power to California Institute of Technology and Professor Yaser Abu-Mostafa.

Great content and taught in an Excellent manner

I like his pronunciation. I'm even able to listen to him without subtitle.

great teaching ever i seen

I couldnt understand the Perceptron correction part .May be because he chose to explain using vectors. This video got me confused about perceptrons.

can anyone give me his book of Machine Learning

excelent. good machine learning.

Great video.

He shouldn't use his hand to explain the perceptron, look at his hand I can't make sense the perceptron, why doesn't use a table with a chalk

求中文翻译。。。。

thnk u caltech n honourable professor for this noble work

Is random forest a hypothesis set or learning algorithm?

Best lectures of machine learning with clear voice 🙂 Thanks Professor

My Learning Algorithm predicts that the professor loves his tea.

Straight to the point with practical examples. After 10 minutes you already know what the course is about and if it is interesting to you. The best Lecture 1 of a course I have seen on any subject 🙂

Thanks Professor well explained

Is this bachelor's level, true?

Yaser, you are not just a scientist but also an amazing teacher — your students must be lucky to have you!

Thank you Professor Yaser Abu-Mostafa and to all you collogues and contributors. Most helpful and enjoyable. I'm impressed with Caltech.

Enlightening analogy on unsupervised learning (learning Portugal language). Great teacher!

I finally got the book and I'm glad I did.

perceptron algorithm is bad computationally and good in term of generalization.because it is simplistic algorithm.

good，the book is learn from data.

Brilliant, thank you

Why some people are walking during the lessons, coughing during the concerts, really what are their problem??

Awesome teacher, teaching from realization…everything he makes so easy…kind of idol to me, added with his smile and ascent …yes Sensational..

thanks sir,this lecture is very helpful for every beginers

god has a new name and he has research & teaching interests in machine learning- simply wow Professor Yaser Abu-Mostafa. thanks and best regards

I recommend 1.25 speed for these videos

i think one of the best introduction videos to machine learning! 🙂 thanks a lot

Fucking scumbag credit customer #2. Smh.

if this too slow for you. console > document.getElementsByTagName('video')[0].playbackRate = 1.2

Wow, I could not find any other resource that explained the intuition behind using the inner product. What a great teacher.

does anybody have this lecture notes and can upload ?

soo i watched it at 0.5x

just to be that one guy

this guy knows a lot about a lot, you can tell by the confidence with which he navigates the subject and how he is aware of the limits of the theory

How is the original w generated? I'm guessing also by the correction function w'=w+yixi?

The perceptron function hypothesis is actually a logistic regression, that guy makes it quite simple to understands the difference between the hypothesis/target function and the learning algorithm, the later is the true innovation for statisticians and econometricians.

teaching with passion..

Nothing beats ML lectures from Professor Gru (Despicable Me!!)

-1

l came here by chance and i surprised that this prof is Egyptian ,i am really proud of you sir you have a great style in teaching

That's wonderful man

Can anybody please help me with the prerequisites of this course? As I am finding a little difficult to cope-up with the course.

Any help would be highly appreciated………..Plz help.

I had to come to comments section and search the man behind this lecture.

A wonderful lecturer wish all teachers could teach like him

Notations at 17:50

The checkered board target function example is a great example of confirmation bias. The symmetry analysis probably feels "correct" to a lot of people because it "explains" the "most" information – and yet it doesn't.

Outstanding teacher. If I was in the audience I'd throw my panties at him.

+1

He sounds like a Russian Villain in those Hollywood movies.

kkkk…topzeira this professor.

Who is watching in 2019? Hit Like!

anyone watching in 2019

Excellent , Thank you prof

kids : normal

Men : 1.25x

Legends :1.5x

This is a teacher, I like the way explains the math and this confidence in his voice.

48:05 It's some kind of revelation. I'm enjoying this lecture so much! Thank you!

And Yaser is so intelligent and pleasant person.

You are awesome Sir !! thank you

The lecturer makes college courses like porns, damn!

YEY!!! Thank you very much for such an amazing course!!!

nice way of explaining things.

My CAT Decided What I ATE for 24 HOURS (And This Is What Happended…)

https://chrome.google.com/webstore/detail/threelly-ai-for-youtube/dfohlnjmjiipcppekkbhbabjbnikkibo

thanks prof. Yaser, you are gem of a professor, is there any other lecture series by you on some other topics? exemplary clarity of thought!

Prof. Yaser is a great Teacher. I'm seeing this for the first time (October 2019). Still the best.

Thank you for this great lecture.

Egyptian in the top💙💙2019

这个课程大纲和林轩田的好像啊，课程图标也很像…