Machine Learning Questions

How do we measure the accuracy of a hypothesis function?

By using a cost function, usually denoted by J.

What is the definition of a cost function of a supervised learning problem?

Takes an average difference of all the results of the hypothesis with inputs from x's and the actual output y's.

What are alternative terms of a Cost Function? #2

• Squared error function.

• Mean squared error.

State the algorithm for gradient descent.

Repeat until convergence, where j=0,1 represents the feature index number.

How does gradient descent converge with a fixed step size alpha? #2

• As we approach a local minimum, gradient descent will take smaller steps.

• Thus no need to decrease alpha over time.

What is the algorithm for implementing gradient descent for linear regression? #2

• We can substitute our actual cost function and our actual hypothesis function.

• m is the size of the training set, theta 0 a constant that will be changing simultaneously with theta 1 and x, y are values of the given training set (data).

Give a derivation of for a single example in batch gradient descent! (Gradient Descent For Linear Regression)

Derivation of a single variable in gradient descent.

What is batch gradient descent? #2 (Gradient Descent For Linear Regression)

• Gradient descent on the original cost function J.

•This method looks at every example in the entire training set on every step.

How does batch gradient descent differ from gradient descent? (Gradient Descent For Linear Regression)

While gradient descent can be susceptible to local minima in general, batch gradient descent has only one global, and no other local, optima.

Depict an example of gradient descent as it is run to minimize a quadratic function. #2

• shown is the trajectory taken by gradient descent, which was initialized at 48,30.

• The x's in the figure (joined by straight lines) mark the successive values of theta that gradient descent went through as it converged to its minimum.

What is multivariate linear regression?

Linear regression with multiple variables.

What is the notation for equations where we can have any number of input variables? (Multivariate Linear Regression)

Notation.

Upgrade to remove ads

Only $35.99/year

What is the multivariate form of a hypothesis function?

Multivariate form of the hypothesis function.

What is the intuition of the multivariable form of a hypothesis function in the example of estimating housing prices? #2

• We can think about theta 0 as the basic price of a house, theta 1 as the price per square meter, theta 2 as the price per floor, etc.

• x1 will be the number of square meters in the house, x2 the number of floors, etc.

Give the vectorization of the multivariable form of a hypothesis function.

Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as:

Why do we assume that x0=1 in multivariate linear regression?

Convention.

What is the Gradient Descent for Multiple Variables? #2

• The gradient descent equation itself is generally the same form.

• we just have to repeat it for our 'n' features.

How can we speed up gradient descent?

We can speed up gradient descent by having each of our input values in roughly the same range.

Why does feature scaling speed up gradient descent? #2

• This is because theta will descend quickly on small ranges and slowly on large ranges.

• Thus it will oscillate inefficiently down to the optimum when the variables are very uneven.

What are the ideal ranges of our input variables in gradient descent? #2

• For example a range between minus 1 and 1.

• These aren't exact requirements; we are only trying to speed things up.

What is feature scaling? #2

• Involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable.

• Results in a new range of just 1.

What is mean normalization? #2

• Involves subtracting the average value for an input variable from the values for that input variable.

• Results in a new average value for the input variable of just zero.

Upgrade to remove ads

Only $35.99/year

How do you implement both feature scaling and mean normalization? #2

Feature Scaling and Mean Normalization.

How can you debug gradient descent? #3

• Make a plot with number of iterations on the x-axis.

• Now plot the cost function J of theta over the number of iterations of gradient descent.

• If J of theta ever increases, then you probably need to decrease alpha.

How can the step parameter alpha in gradient descent cause bugs? #2

• If alpha is too small: slow convergence.

• If alpha is too large: may not decrease on every iteration and thus may not converge.

What is the Automatic convergence test in gradient descent? #2

• Declare convergence if J of theta decreases by less than E in one iteration, where E is some small value such as 0.001.

• However in practice it's difficult to choose this threshold value.

How can we improve our features? (Multivariate Linear Regression) #2

• We can combine multiple features into one.

• For example, we can combine x1 and x2 into a new feature x3 by taking x1 times x2.

How can we improve the form of our hypothesis function? (Multivariate Linear Regression)

By making it a quadratic, cubic or square root function (or any other form).

What important thing should one keep in mind if one changes the form of a hypothesis function? (Multivariate Linear Regression) #2

• If you create new features when doing polynomial regression then feature scaling becomes very important.

• For example, if x has range 1 - 1000 then range of x^2 becomes 1 - 1000000.

State the normal equation formula!

Normal Equation Formula.

Compare gradient descent and the normal equation!

The following is a comparison of gradient descent and the normal equation:

Does feature scaling speed up the implementation of the normal equation?

There is no need to do feature scaling with the normal equation.

Upgrade to remove ads

Only $35.99/year

What is the complexity of computing the inversion with the normal equation?

With the normal equation, computing the inversion has complexity of n cubed.

When might it be a good time to go from a normal solution to an iterative process?

When the number of examples exceeds 10,000 due to the complexity of the normal equation.

Which function do we want to use in octave when implementing the normal equation? #2

• Use the 'pinv' function rather than 'inv'.

• The 'pinv' function will give you a value of theta even if X Transpose X is not invertible.

What are common causes for X Transpose X to be noninvertible? #2

• Redundant features, where two features are very closely related (i.e. they are linearly dependent).

• Too many features (e.g. m ≤ n). In this case, delete some features or use "regularization".

How do we change the form of our binary hypothesis function to be continuous in the range between 0 and 1?

By using the Sigmoid Function, also called the Logistic Function.

How can we interpret the output of our logistic function?

h of theta of a given input variable give us the probability that our output is 1.

How can we get our discrete 0 or 1 classification from a logistic function?

We can translate the output of the hypothesis function as follows:

What is the decision boundary given a logistic function? #2

• The decision boundary is the line that separates the area where y = 0 and where y = 1.

• It is created by our hypothesis function.

How does the cost function for a logistic regression look like? #2

• We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy, causing many local optima.

• In other words, it will not be a convex function.

Plot the cost function J, if the correct answer for y is 1.

• If our correct answer 'y' is 1, then the cost function will be 0 if our hypothesis function outputs 1.

• If our hypothesis approaches 0, then the cost function will approach infinity.

Upgrade to remove ads

Only $35.99/year

Plot the cost function, if the correct answer for y is 0. #2

• If our correct answer 'y' is 0, then the cost function will be 0 if our hypothesis function also outputs 0.

• If our hypothesis approaches 1, then the cost function will approach infinity.

How can we simplify our cost function? (Logistic Regression Model)

We can compress our cost function's two conditional cases into one case.

Give the vectorized implementation of our simplified cost function! (Logistic Regression Model)

A vectorized implementation is:

Give the vectorized implementation for Gradient Descent! (Logistic Regression Model)

A vectorized implementation is:

What is gradient descent for our simplified cost function? (Logistic Regression Model) #2

• Notice that this algorithm is identical to the one we used in linear regression.

• We still have to simultaneously update all values in theta.

Depict an example of One-versus-all to classify 3 classes! (Multiclass Classification)

The following image shows how one could classify 3 classes:

What is the implementation of One-versus-all in Multiclass Classification? #2

• Train a logistic regression classifier h of theta for each class to predict the probability that y = i .

• To make a prediction on a new x, pick the class that maximizes h of theta.

What is underfitting?

Underfitting, or high bias, is when the form of our hypothesis function h maps poorly to the trend of the data.

What usually causes underfitting?

It is usually caused by a function that is too simple or uses too few features.

What is overfitting?

Overfitting, or high variance, is caused by a hypothesis function that fits the available data but does not generalize well to predict new data.

Upgrade to remove ads

Only $35.99/year

What usually causes overfitting?

It is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data.

What are the two main options to address the issue of overfitting? #4

Reduce the number of features:

• Manually select which features to keep.

• Use a model selection algorithm.

Regularization:

• Keep all the features, but reduce the magnitude of parameters theta j.

• Regularization works well when we have a lot of slightly useful features.

In a basic sense, what are neurons?

Neurons are basically computational units that take inputs, called dendrites, as electrical inputs, called "spikes", that are channeled to outputs , called axons.

What are the dendrites in the model of neural networks?

In our model, our dendrites are like the input features.

What are the axons in the model of neural networks?

In our model, the axons are the results of our hypothesis function.

What is the bias unit of a neural network? #2

• The input node x0 is sometimes called the "bias unit."

• It is always equal to 1.

What are the weights of a neural network?

Using the logistic function, our "theta" parameters are sometimes called "weights".

What is the activation function of a neural network?

The logistic function (as in classification) is also called a sigmoid (logistic) activation function.

How do we label the hidden layers of a neural network? #2

• We label these intermediate or hidden layer nodes.

• The nodes are also called activation units.

How do we determine the dimension of the matrices of weights? (Neural Network) #2

• The +1 comes from the addition of the "bias nodes.

• In other words the output nodes will not include the bias nodes while the inputs will.

How do we obtain the values for each of the activation nodes, given a single-layer neural network with 3 activation nodes and a 4-dimensional input? #2

• We apply each row of the parameters to our inputs to obtain the value for one activation node.

• Our hypothesis output is the logistic function applied to the sum of the values of our activation nodes, which have been multiplied by the parameter matrix theta 2.

Give an example of the implementation of the OR-function as a neural network!

The following is an example of the logical operator 'OR', meaning either x1 is true or x2 is true, or both:

Give an example of the implementation of the AND-function as a neural network!

The following is an example of the logical operator AND, meaning it s only true if both x1 and x2 are 1.

What are the theta-matrices for implementing the logical functions 'AND', 'NOR', and 'OR' as a neural network?

Theta Matrices for Neural Network implementation.

How can we implement the 'XNOR' operator with a neural network?

We can implement the 'XNOR' operator by using two hidden layers.

Give an example of a neural network which classifies data into one of four categories!

The inner layers, each provide us with some new information which leads to our final hypothesis function.

State the cost function for neural networks. #3

• the double sum simply adds up the logistic regression costs calculated for each cell in the output layer.

• the triple sum simply adds up the squares of all the individual theta s in the entire network.

• the i in the triple sum does not refer to training example i.

How are the variables L, s of l and K in the cost function of a neural network defined? #3

• L = total number of layers in the network.

• s of l = number of units (not counting bias unit) in layer l.

• K = number of output units.

Define Backpropagation for neural networks.

Backpropagation is neural-network terminology for minimizing our cost function.

What does the matrix Delta in the Back propagation algorithm do?

The capital-delta matrix D is used as an "accumulator" to add up our values as we go along and eventually compute our partial derivative.

State the Backpropagation algorithm.

Backpropagation algorithm.

Which method randomly initializes our weights for our Theta matrices of a neural network?

We initialize each Theta l,i,j to a random value between minus epsilon and epsilon.

Give the setup of using a neural network. #4

• Pick a network architecture.

• choose the layout of your neural network.

• Number of input units; dimension of features x i.

• Number of output units; number of classes.

• Number of hidden units per layer; usually more the better.

How does one train a neural network? #6

1. Randomly initialize the weights.

2. Implement forward propagation.

3. Implement the cost function.

4. Implement backpropagation.

5. Use gradient checking to confirm that your backpropagation works.

6. Use gradient descent to minimize the cost function with the weights in theta.

What code is implemented if we perform forward and back propagation?

When we perform forward and back propagation, we loop on every training example.

How can we break down our decision process deciding what to do next? #6

• Getting more training examples: Fixes high variance.

• Trying smaller sets of features: Fixes high variance.

• Adding features: Fixes high bias.

• Adding polynomial features: Fixes high bias.

• Decreasing lambda: Fixes high bias.

• Increasing lambda: Fixes high variance.

What issue poses a neural network with fewer parameters?

A neural network with fewer parameters is prone to underfitting.

What issue poses a neural network with more parameters?

A large neural network with more parameters is prone to overfitting.

How can you address the overfitting of a large neural network?

In this case you can use regularization (increase λ) to address the overfitting.

What bias-variance tradeoff do lower-order polynomials (low model complexity) have?

Lower-order polynomials (low model complexity) have high bias and low variance.

What is the issue with higher-oder polynomials in regard to fitting the training data and test data?

Higher-order polynomials (high model complexity) fit the training data extremely well and the test data extremely poorly.

What bias-variance tradeoff do higher-order polynomials (high model complexity) have?

Higher-order polynomials (high model complexity) have low bias on the training data, but very high variance.

Why does training an algorithm on a very few number of data points easily have 0 errors?

Because we can always find a quadratic curve that touches exactly those number of points.

How does one experience high bias with a low training set size?

Low training set size: causes J train of theta to be low and J CV of theta to be high.

How does one experience high bias with a large training set size?

Large training set size: causes both J train of theta and J CV of theta to be high with J train approximately equal to J CV.

What approach will not generally help much by itself, when a learning algorithm is suffering from high bias?

If a learning algorithm is suffering from high bias, getting more training data will not (by itself) help much.

How does one experience high variance with a low training set size?

Low training set size: J train of theta will be low and J CV of theta will be high.

How does one experience high variance with a large training set size?

Large training set size: J train of theta increases with training set size and J CV of theta continues to decrease without leveling off.

Under which circumstances will getting more training data help a learning algorithm to perform better?

If a learning algorithm is suffering from high variance, getting more training data is likely to help.

What is the relationship between the degree of the polynomial d and the underfitting or overfitting of our hypothesis? #2

• High bias (underfitting): both J train(Θ) and J CV(Θ) will be high. Also, J CV(Θ) is approximately equal to J train(Θ).

• High variance (overfitting): J train(Θ) will be low and J CV(Θ) will be much greater than J train(Θ).

Which 3 separate error values can we calculate, if we break down our dataset as such:

• Training set: 60%.

• Cross validation set: 20%.

• Test set: 20%.

1. Optimize the parameters in Theta (Θ) using the training set for each polynomial degree.

2. Find the polynomial degree d with the least error using the cross validation set.

3. Estimate the generalization error using the test set with J test, using d = theta from polynomial with lower error.

Which function returns the values for jVal and gradient in a single turn?

function [jVal, gradient] = costFunction(theta)

jVal = [...code to compute J(theta)...];

gradient = [...code to compute derivative of J(theta)...];

end

Given costFunction(), what do we have to do to implement fminunc()?

we can use octave's "fminunc()" optimization algorithm along with the "optimset()" function that creates an object containing the options we want to send to "fminunc()"

How can we approach regularization using the alternate method of the non-iterative normal equation?

To add in regularization, the equation is the same as our original, except that we add another term inside the parentheses:

Given a training set and a test set, what is the new procedure for evaluating a hypothesis?

The new procedure using these two sets is then:

1. Learn Θ and minimize Jtrain(Θ) using the training set

2. Compute the test set error Jtest(Θ)

What is the test set error for linear regression?

For linear regression

What is the test set error for linear classification?

For classification ~ Misclassification error (aka 0/1 misclassification error):

What is the average test error for the test set?

The average test error for the test set is.

This gives us the proportion of the test data that was misclassified.

Just because a learning algorithm fits a training set well, that does not mean [...]

Just because a learning algorithm fits a training set well, that does not mean it is a good hypothesis.

The error of your hypothesis as measured on the data set with which you trained the parameters will be [...]

The error of your hypothesis as measured on the data set with which you trained the parameters will be lower than any other data set.

Model Selection without the validation set?

1. Optimize the parameters in Θ using the training set for each polynomial degree.

2. Find the polynomial degree d with the least error using the test set.

3. Estimate the generalization error also using the test set with Jtest(Θ(d)), (d = theta from polynomial with lower error);

What is the consequence of selecting a model without the validation set?

In this case, we have trained one variable, d, or the degree of the polynomial, using the test set. This will cause our error value to be greater for any other set of data.

How can we solve the problem of selecting a model with only the training set?

To solve this, we can introduce a third set, the Cross Validation Set, to serve as an intermediate set that we can train d with.

Then our test set will give us an accurate, non-optimistic error.

Model Selection with the validation set?

1. Optimize the parameters in Θ using the training set for each polynomial degree.

2. Find the polynomial degree d with the least error using the cross validation set.

3. Estimate the generalization error using the test set with Jtest(Θ(d)), (d = theta from polynomial with lower error);

What are the benefits of selecting a model with the validation set?

This way, the degree of the polynomial d has not been trained using the test set.

note: be aware that using the CV set to select 'd' means that we cannot also use it for the validation curve process of setting the lambda value

High bias (underfitting):

both Jtrain(Θ) and JCV(Θ) will be high. Also, JCV(Θ)≈Jtrain(Θ)

High variance (overfitting):

Jtrain(Θ) will be low and JCV(Θ) will be much greater thanJtrain(Θ).

A large lambda [...], which [...].

A large lambda heavily penalizes all the Θ parameters, which greatly simplifies the line of our resulting function, so causes underfitting.

The relationship of λ to the training set and the variance set is as follows:

Low λ:

Jtrain(Θ) is low and JCV(Θ) is high (high variance/overfitting).

The relationship of λ to the training set and the variance set is as follows:

Intermediate λ:

Jtrain(Θ) and JCV(Θ) are somewhat low and Jtrain(Θ)≈JCV(Θ).

The relationship of λ to the training set and the variance set is as follows:

Large λ

both Jtrain(Θ) and JCV(Θ) will be high (underfitting /high bias)

What do we need in order to choose the model and the regularization λ?

1. Create a list of lambdas (i.e. λ∈{0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24});

2. Create a set of models with different degrees or any other variants.

3. Iterate through the λs and for each λ go through all the models to learn some Θ.

4. Compute the cross validation error using the learned Θ (computed with λ) on the JCV(Θ) without regularization or λ = 0.

5. Select the best combo that produces the lowest error on the cross validation set.

6. Using the best combo Θ and λ, apply it on Jtest(Θ) to see if it has a good generalization of the problem.

As the training set gets larger, the error [...] increases.

As the training set gets larger, the error for a quadratic function increases.

The error value will [...] after a certain m, or training set size.

The error value will plateau out after a certain m, or training set size.

With high bias

Low training set size: [...]

Low training set size: causes Jtrain(Θ) to be low and JCV(Θ) to be high.

With high bias

Large training set size: [...]

Large training set size: causes both Jtrain(Θ) and JCV(Θ) to be high with Jtrain(Θ)≈JCV(Θ).

With high variance

Low training set size: [...]

Low training set size: Jtrain(Θ) will be low and JCV(Θ) will be high.

With high variance

Large training set size:

Large training set size: Jtrain(Θ) increases with training set size and JCV(Θ) continues to decrease without leveling off.

lso, Jtrain(Θ)<JCV(Θ) but the difference between them remains significant.

How can we tell which parameters Θ to leave in the model (known as "model selection")?

There are several ways to solve this problem:

- Get more data (very difficult).

- Choose the model which best fits the data without overfitting (very difficult).

- Reduce the opportunity for overfitting through regularization.

Bias: approximation error (Difference between expected value and optimal value)

- High Bias = UnderFitting (BU)

- Jtrain(Θ) and JCV(Θ) both will be high and Jtrain(Θ) ≈ JCV(Θ)

Variance: estimation error due to finite data

- High Variance = OverFitting (VO)

- Jtrain(Θ) is low and JCV(Θ) ≫Jtrain(Θ)

Intuition for the bias-variance trade-off:

Complex model => [...]

Complex model => sensitive to data => much affected by changes in X => high variance, low bias.

Intuition for the bias-variance trade-off:

Simple model => [...]

Simple model => more rigid => does not change as much with changes in X => low variance, high bias.

Regularization Effects?

- Small values of λ allow model to become finely tuned to noise leading to large variance => overfitting.

- Large values of λ pull weight parameters to zero leading to large bias => underfitting

Model Complexity Effects?

Lower-order polynomials (low model complexity) have high bias and low variance. In this case, the model fits poorly consistently.

Higher-order polynomials (high model complexity) fit the training data extremely well and the test data extremely poorly. These have low bias on the training data, but very high variance.

In reality, we would want to choose a model somewhere in between, that can generalize well but also fits the data reasonably well.

A typical rule of thumb when running diagnostics is?

More training examples fixes high variance but not high bias.

Fewer features fixes high variance but not high bias.

Additional features fixes high bias but not high variance.

The addition of polynomial and interaction features fixes high bias but not high variance.

When using gradient descent, decreasing lambda can fix high bias and increasing lambda can fix high variance (lambda is the regularization parameter).

When using neural networks, small neural networks are more prone to under-fitting and big neural networks are prone to over-fitting. Cross-validation of network size is a way to choose alternatives.

Different ways we can approach a machine learning problem?

Collect lots of data (for example "honeypot" project but doesn't always work)

Develop sophisticated features (for example: using email header data in spam emails)

Develop algorithms to process your input in different ways (recognizing misspellings in spam).

The recommended approach to solving machine learning problems is?

Start with a simple algorithm, implement it quickly, and test it early.

Plot learning curves to decide if more data, more features, etc. will help

Error analysis: manually examine the errors on examples in the cross validation set and try to spot a trend.

It's important to get error results as [...]. Otherwise it is difficult to [...]

It's important to get error results as a single, numerical value. Otherwise it is difficult to assess your algorithm's performance

It is sometimes difficult to tell whether a reduction in error is actually an improvement of the algorithm. When does this usually happen?

This usually happens with skewed classes; that is, when our class is very rare in the entire data set.

Or to say it another way, when we have lot more examples from one class than from the other clas

Predicted: 1, Actual: 1

Predicted: 1, Actual: 1 --- True positive

Predicted: 0, Actual: 0

Predicted: 0, Actual: 0 --- True negative

Predicted: 0, Actual, 1

Predicted: 0, Actual, 1 --- False negative

Predicted: 1, Actual: 0

Predicted: 1, Actual: 0 --- False positive

Precision: of all patients we predicted where y=1, [...]

Precision: of all patients we predicted where y=1, what fraction actually has cancer?

Recall: Of all the patients that actually have cancer, what [...]

Recall: Of all the patients that actually have cancer, what fraction did we correctly detect as having cancer?

When is the F1 score not defined?

if an algorithm predicts only negatives like it does in one of exercises, the precision is not defined, it is impossible to divide by 0.

We might want a confident prediction of two classes using logistic regression. One way is to [...]

We might want a confident prediction of two classes using logistic regression. One way is to increase our threshold.

This way, we only predict cancer if the patient has a 70% chance.

What tradeoff will we have if we want a more confident prediction of two classes using logistic regression?

Doing this, we will have higher precision but lower recall (refer to the definitions in the previous section).

What is the result if we want to get a very safe prediction?

This will cause higher recall but lower precision.

Trading Off Precision and Recal

The greater the threshold, the [...]

The greater the threshold, the greater the precision and the lower the recall.

Trading Off Precision and Recal

The lower the threshold, the [...]

The lower the threshold, the greater the recall and the lower the precisio

Trading Off Precision and Recal

In order to turn these two metrics into one single number, we can [...]

In order to turn these two metrics into one single number, we can take the F value.

In order for the F Score to be large, [...]

In order for the F Score to be large, both precision and recall must be large.

We want to train precision and recall on the [...]

We want to train precision and recall on the cross validation set so as not to bias our test set

In certain cases, an "inferior algorithm," if given [...], can [...]

In certain cases, an "inferior algorithm," if given enough data, can outperform a superior algorithm with less data.

We must choose our features to have [...].

A useful test is: [...]

We must choose our features to have enough information.

A useful test is: Given input x, would a human expert be able to confidently predict y?

What is the "Rationale for large data"?

Rationale for large data: if we have a low bias algorithm (many features or hidden units making a very complex function), then the larger the training set we use, the less we will have overfitting (and the more accurate the algorithm will be on the test set).

Search This Blog

School Answers

Obi-Wan Kenobi Full Season Download

Machine Learning Questions

Comments

Post a Comment

Popular posts from this blog