0% found this document useful (0 votes)
30 views9 pages

Week 4 Solution PDS

Uploaded by

Netaji Gandi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views9 pages

Week 4 Solution PDS

Uploaded by

Netaji Gandi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

NPTEL-PYTHON FOR DATA SCIENCE

ASSIGNMENT-4-SOLUTION

1. Answer: B:pandas.get_dummies():
• This function will encode dummy values for each categorical variable. Each
category will be added as a new column in the dataframe.

2. Answer:D: Three key benefits of performing feature selection on your data are:
• Reduces Overfitting: Less redundant data means fewer error due to noise
• Improves Accuracy: Removing redundant data improves accuracy
• Reduces Training Time: Less data means that algorithms train faster

3. Answer:C: sklearn.model_selection.train_test_split()
• The dataset is usually split into training data and test data. The model learns from
the training data. We use the test dataset in order to test our model’s predictions.
4. Answer:B
• k is the number of nearest neighbours used to predict the class

5. Answer:C: sklearn.neighbors.KNeighborsClassifier()
• The sklearn library has provided a layer of abstraction on top of Python
• Therefore, in order to make use of the KNN algorithm, it’s sufficient to create an
instance of KNeighborsClassifier.

6. Answer:A
The standardized residuals of a model are plotted against the predicted values.
This is called a residual plot. When the residuals’ variance is not equal(constant)
then it is called Heteroscedasticity.
7. Answer:B:
R-squared is the percentage of the response variable variation that is explained by
a linear model. R-squared is always between 0 and 1 where:
o 0 indicates that the model explains none of the variability of the response
variable is explained by the model.
o 1 indicates that the model explains all the variability of the response
variable is explained by the model.
8. Answer:A
• The number of correct and incorrect predictions are summarized with count
values
• The number of participants that have been wrongly classified as female is 15
9. Answer:D
• The Akaike information criterion (AIC) is an estimator of the relative quality of
statistical models for a given set of data
• Thus, AIC provides a means for model selection
10. Answer: D
• Maximum likelihood will provide values of β0 and β1 which maximize the
probability of the occurrence of the dependent variable
• We use the log-likelihood function to estimate the probability of observing the
dependent variable, given the unknown parameters (β0 and β1)
11. Answer: A

• The degree of Gini index ranges between 0 and 1, where 0 denotes that all
elements belong to one class and 1 denotes that the elements are randomly
distributed across various classes
Use the following codes to import your data and then proceed
with the questions:

12. INPUT

OUTPUT

INFRENCE: Answer: D
None of the variables in the data has missing values.
13. INPUT:
OUTPUT:

INFRENCE: Answer: B
The third quartile for the variable “lastEvaluation” is 0.87.
14. INPUT:

OUTPUT:

INFRENCE: Answer: C
The “SALES” department has the highest frequency in low salary category
15. INPUT:

OUTPUT:

INFRENCE: Answer: B
From the above plot we can see that the median value for the “numberOfProjects” where the
employees have worked on is “4”.
16. & 17: INPUT:
OUTPUT:

INFRENCE: Answer for Q:16: A and Answer for Q:17: D


The Accuracy of our model is “80%” and the number of Misclassified samples are “745”.
18. INPUT:
OUTPUT:

INFRENCE: Answer: C
From the plot we can see that the range in which the number of employees worked for 150 hours per
month is Above 2500.

19. INPUT:
OUTPUT:

INFRENCE: Answer: A
The accuracy score of the predicted model is 95%.

20. INPUT:
OUTPUT:

INFRENCE: Answer: C
From the plot we can see that, the people who have worked in two projects performance level is
low not high.

You might also like