forwards

Thursday, July 7, 2022

BITS-WILP-Machine Learning - ML - Comprehensive Examination - 2019-2020

Birla Institute of Technology & Science, Pilani

Work Integrated Learning Programmes Division

Second Semester 2019-20

M.Tech. (Data Science and Engineering)

Comprehensive Examination (Makeup)

Course No. : DSECLZG565

Course Title : MACHINE LEARNING

Nature of Exam : Open Book

Weightage : 40%

Duration : 2 Hours

Date of Exam: July 12, 2020 Time of Exam: 10:00 AM – 12:00 PM

Note: Assumptions made if any, should be stated clearly at the beginning of your answer.

Question 1. [3+3+2+3=11 Marks]

Suppose you receive messages in sequence of bits (0’s and 1’s) with unknown bias θ for 1’s; there is a message sequence as x1, x2, ..., xn of length n is received.

What θ maximizes the likelihood of the data observed (in terms of n) ? Assume that sample x1, x2, ..., xn is from a parametric distribution f (x|θ), where f (x|θ) is the Bernoulli probability mass function with parameter θ. [3 marks]

In context of naive Bayes, what is meant by Laplace smoothing? [1 mark]

Handling extremely low probabilities.
None of these
Make zero probabilities non-zero.
Making probabilities zero.

Why Naïve Bayes algorithm is called so? [2 marks]

Consider fitting a logistic regression model to predict whether a customer will default the bank loan or not given his bank balance, income and whether student/non-student. The optimal model coefficients are: Intercept = -10.86, balance = 0.0057* balance, income = 0.0030 and student = -0.6468. Predict whether a student with balance of Rs.1500 and an income of Rs 40,000 will default or not. [2 marks]

The regression line for predicting weight from height is height=1.51*weight+45.47. Heights is in cm and weights in kg Interpret the equation and find the height of a person whose weight is 100kgs [2+1=3marks]

Question 2. [2+5=7 Marks]

An odd parity generator outputs a ‘1’ when sum of ‘1’s in an input binary sequence is odd.

What are the parity bits P for a binary sequence (x1, x2) of length 2? x1, x2 are either 0 or 1. [2 marks]
Realize an odd parity generator for binary sequence of length 2 using an MLP, with the following logic gate building blocks (with sigmoidal activation function). Show the network architecture with all weights and bias values. [ 1+1+3 = 5 marks]

Question 3. Answer the following questions. [5+5 =10 Marks]

Consider training an AdaBoost classifier using decision stumps on the following data set. Decision stump classifier chooses a constant value c and classifies all points where x > c as one class and other points where x ≤ c as the other class.

1. What is the initial weight that is assigned to each data point? [1 marks]

2. Show the decision boundary for the first decision stump (indicate the positive and negative side of the decision boundary). [2 marks]

3. Circle the point whose weight increases in the boosting process [2 marks]

Suppose you are given the following pairs. You will simulate the k-means algorithm to identify TWO clusters in the data. Suppose you are given initial assignment cluster centre as {cluster1: #1}, {cluster2: #10} – the first data point is used as the first cluster centre and the 10th as the second cluster centre. Please simulate the k-means (k=2) algorithm for one iteration. What are the cluster assignments after one iteration? Assume k-means uses Euclidean distance.

[5 Marks]

Data #	x	y
1	1.9	0.97
2	1.76	0.84
3	2.32	1.63
4	2.31	2.09
5	1.14	2.11
6	5.02	3.02
7	5.74	3.84
8	2.25	3.47
9	4.71	3.6
10	3.17	4.96

Question 4. Answer the following questions. [5 Marks]

Students in a particular class are graded in subjects A, B and C out of 10 points. Based on the information provided in the table below for 8 students, predict using KNN algorithm approach if a student who scored the following grades A 5; B 7; C 6 will pass or fail?

When K = 3?

Score in A	Score in B	Score in C	Result
9	5	7	Pass
7	3	6	Fail
5	8	9	Pass
8	6	7	Pass
4	7	8	Fail
6	7	6	Pass
6	8	5	Fail
5	6	5	Fail

Question 5. Answer the following questions. [7 Marks]

Solve the below and find the equation for hyper plane using linear Support Vector Machine method.

Positive Points: {(3, 2), (4, 3), (2, 3), (3, -1)}

Negative Points: {(1, 0), (-1, -3), (0, 2), (-1, 2)}

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

BITS-WILP-Machine Learning - ML - Mid Semester - 2019-2020

Birla Institute of Technology & Science, Pilani

Work-Integrated Learning Programmes Division

Second Semester 2019-2020

M.Tech (Data Science and Engineering)

Mid-Semester Test (EC-2 Regular)

Course No. : DSECL ZG565

Course Title : MACHINE LEARNING

Nature of Exam : Closed Book

Weightage : 30%

Duration : 90 minutes

Date of Exam : December 29, 2019 (FN)

Note:

Please follow all the Instructions to Candidates given on the cover page of the answer book.
All parts of a question should be answered consecutively. Each answer should start from a fresh page.
Assumptions made if any, should be stated clearly at the beginning of your answer.

Answer All the Questions (only on the pages mentioned against questions. if you need more pages, continue remaining answers from page 20 onwards)

Question 1. [Marks 2+3=5] [to be answered only on pages 3-5]

a) What are the steps in designing a machine learning system (2 marks)

b) A survey was conducted of 200 families to observe the relationship between average annual income per year and whether the family will buy car or not. Consider the following table:

	Income below Rs 10 lakhs	Income >= Rs 10 lakhs	Total
Buyer	38	42	80
Non-Buyer	82	38	120
Total	120	80	200

What is the probability that randomly selected family is a buyer? (1 marks)
What is the probability that a randomly selected family is both buyer of the car and has income of Rs 10 lakh and above? (1 mark)
A family selected at random belongs to the category of income greater than Rs 10 lakhs. What is the probability that they will buy a car? (1 marks)

Question 2. [Marks =5] [to be answered only on pages 6-7]

Consider there are two bags A and B, where A contains 5 white balls and 7 blue balls whereas B contains 2 white and 12 blue balls. We pick bag A, 50% of the time. After an experiment, a white ball is selected. What is the probability that the ball is drawn from bag B? (5 marks)

Question 3. [Marks=5] [to be answered only on pages 8-9]

Given the following labelled training data,

Flat 20% Cashback on Oyo Room bookings done via Paytm. (SPAM)

Lets Talk Fashion! Get flat 40% Cashback on Backpacks (SPAM)

Opportunity with Product firm for Fullstack (HAM)

Javascript Developer, Full Stack Developer in Bangalore (HAM)

Use Naive Bayes Classifier with laplace smoothing to identify classification of the sentence “Scan Paytm QR Code to Pay & Win 100% Cashback”

Question 4. [Marks 2+3=5] [to be answered only on pages 10-11]

Explain the cost/error function used in logistic regression (2 marks)
Compare Probabilistic generative model and probabilistic discriminative models with examples. (3 marks)

Question 5. [Marks 3+2=5] [to be answered only on pages 12-14]

Plot cost function J (w) for linear regression y=w1x for the training data pair <0, 0>,

<0.5, 0.5>, <1, 1>, <1.5, 1.6> (3 marks)

Distinguish Bias and variance in the machine learning domain and discuss how model complexity is affected by these two. (2 marks)

Question 6. [Marks 2+1+2= 5] [to be answered only on pages 15-16]

Provide answers based on the following set of training examples

Instance	a1	a2	Classification
1	T	T	+
2	T	T	+
3	T	F	-
4	F	F	+
5	F	T	-
6	F	T	-

What is the entropy of this collection of training examples with respect to the target function classification (2 marks)
What is the information gain of a2 relative to these training examples (1 marks)
Why do we prefer shorter /smaller trees while learning decision tree? Does ID-3 guarantee shorter tree? (2 mark)

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Saturday, July 2, 2022

BITS - ISM - Important Links -Mid semester

Some links which might be useful for online calculation (Courtesy by Whatsapp member)

Confidence Interval Calculator:

https://www.calculator.net/confidence-interval-calculator.html

Confidence interval calculator with finite population

https://select-statistics.co.uk/calculators/confidence-interval-calculator-population-mean/

Population Correction Calculator

https://select-statistics.co.uk/calculators/confidence-interval-calculator-population-mean/

Z Score calculator:

https://www.calculator.net/z-score-calculator.html

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.Host/author is not responsible for these posts.

Thursday, June 30, 2022

BITS-WILP-IDS-Makeup 2021 - Mid semester with Solutions

Birla Institute of Technology and Science, Pilani

Work Integrated Learning Programmes Division

BITS WILP Programme - M. Tech. in Data Science and Engg

I Semester 2020-2021

Mid-Semester Test

(EC2 - Makeup)

Course Number DSECL ZG523

Course Name Introduction to Data Science

Nature of Exam Open Book

Weight-age for grading 30

Duration 1.5 hrs

Date of Exam

# Pages 2

# Questions 5

Instructions

1. All questions are compulsory.

2. Questions are to be answered in the order in which they appear in this paper.

3. All answers must be directed to the question in short and simple paragraphs or bullet points;

use visuals/diagrams wherever necessary.

4. Assumptions made if any, should be stated clearly at the beginning of your answer.

1. Challenge the accuracy of the below statements (agree/disagree) with short justification.

[Answer should contain few bullet points of no more than 2 – 4 lines]. [5]

(a) For Data Collection and Pre-processing, domain Knowledge is essential, whereas for

Model Building and Evaluation it is optional. [2.5]

• Agree [0.5]

• Data Pre-processing (cleansing, outlier detection, imputation, feature engineering,

etc.) expects domain expertise (in the area of the Data/Business) so that Data

Scientist (or Domain expert) can identify relevant/irrelevant features, right techniques

for Data Imputation and take decisions regarding the validity of the source

data by performing EDA, etc. [1]

• Model building is purely technical activity where the focus is exclusively on the

choice of the model, tuning the parameters and evaluating the performance, etc.

using various tools. [1]

(b) In supervised learning, we do not need a separate Validation-set when there already

exists a Test-set. [2.5]

• Disagree [0.5]

• Validation-set is used for internal assessment of Model performance during development

and for hyper-parameter tuning. [1]

• Test-set is exclusively reserved for Testing the Model Performance on ‘un-seen’

data (kept locked away, not accessible to Model builder, similar to ‘Acceptance

Test’ in software engineering). [1]

2. You have started your spring cleaning activity. On day 10 you have decided that you

will clean your email inbox. During this cleaning, you noticed that you are paying around

Rs.3000/- per month against your internet bill. You want to reduce the bill amount. Assume

relevant data and state the assumprtions clearly at the beginning of your answer. [6]

(a) Identify the type of data analytics that you will use if you observed that the internet

usage has increased because of the pandemic. [1]

(b) Identify the type of data analytics that you will use if you compared the various bills

and then switched to a different plan that better suits your need. [1]

Solution:

(a) Descriptive analytics / EDA / exploratory [1]

(b) Prescriptive analytics [1]

i. Do you have any related requirements? If yes, what are those?

ii. What are the existing/related systems within the capability that capture/use related

information?

iii. What are the gaps?

iv. Who are the stakeholders?

v. Who will be affected by this implementation?

vi. Any Licenses/Commercials needed in case of proprietary solutions?

vii. Are there any other dependencies?

viii. Do you see any other problems/challenges?

ix. Does the client have a technology preference?

x. Does the client have limited / unlimited infrastructure?

xi. How will we measure the business impact of the final model? How will we justify

the project expense against the benefits?

xii. Is there a plan for domain expert validation?

xiii. If so, will the model be in a form that they can understand?

xiv. Is there an evaluation metric set up by the business?

xv. Is there a hold-out data (i.e. data used for validation) available?

xvi. Against what baseline or benchmark the results are compared?

xvii. Are the business costs and benefits considered into account?

3. Consider the below data set for carrying out some data science activity. There are four

columns identified by f1, f2, f3, f4,f5 ("f" means feature). In the table "NA" means "Not

Available" (or missing value). Your task as a data engineer is to obtain the missing values

using some imputation technique. The domain expert gave advice that f3 can be imputed

using mean, f4 can be imputed using median and f5 can be imputed using mean based on

the feature f2. Write the steps to be followed to obtain the missing values and re-write the

dataset with the imputed values. [6]

row f1 f2 f3 f4 f5

1 25 A 5 3 37

2 27 B NA 5 38

3 28 A 17 NA 40

4 24 NA 6 4 43

5 29 C NA 6 36

6 22 C 17 5 NA

7 23 A 5 5 NA

Solution:

(a) f2: Categorical data. Hence mode = A is used to impute row 4. [1]

(b) f3: mean of 5,17,6,5 is 32/4=8. Hence, impute rows 2 and 5 can be imputed using 8.

[1]

(d) f5: row6 = 36 as there is only one value for C. row7 = mean of 37,40,43 = 40. [1*2=2]

(e) [1]

row f1 f2 f3 f4 f5

1 25 A 5 3 37

2 27 B 8 5 38

3 28 A 17 5 40

4 24 A 6 4 43

5 29 C 8 6 36

6 22 C 17 5 36

7 23 A 5 5 40

4. A motor insurance company was has an investigation team that investigates the claims

made. The fraud investigation team investigates up to 30% of all claims made. The company

realised that it was losing too much money due to fraudulent claims. List four ways

in which predictive data analytics could be used to help address this business problem. For

each proposed approach, describe the predictive model that will be built, how the model

will be used by the business, and how using the model will help address the original business

problem. [8]

Solution:

(a) Claim prediction – Predict the likelihood that an insurance claim is fraudulent. Assign

every newly arising claim a fraud likelihood. Those that are most likely to be fraudulent

could be flagged for investigation by the investigation team. Increase the number of

fraudulent claims detected and reduce the amount of money lost of fraud.

(b) Member prediction – Predict the propensity of member to commit fraud. Run this

is model every quarter to identify those members most likely to commit fraud. The

investigation company could take a risk mitigation action ranging from contacting the

member with some kind of warning to canceling the member’s policies. By identifying

members likelu to make fraudulent claims before they make them, the company could

save significant money.

will ultimately result in a fraudulent claim. For every new application this model could

be run and reject those applications that are predicted likelu to resiult in a fraudulent

claim.

(d) Payment prediction – Predict the amount most likely to be paid out by the company

after investigating the claim. Run this model, whenever new caims are made. The

policy holder could be offered the amount predicted by the model as settlement as

an alternative to goind through a claims investigating process. The company could

save on claims investigations and reduce the amount of money paid out on fraudulent

claims.

5. While tuning parameters for a classification algorithm, three different confusion matrices

were produced. For parameter p1, the confusion matrix is c1. For parameter p2, the

confusion matrix is c2. For parameter p3, the confusion matrix is c3. Explain how to draw

RoC curve and write the necessary equations. Using the RoC curve, determine the best

parameter among p1, p2, and p3. Show each of the steps clearly. [5]

Confusion Matrix c1

Predicted: No Predicted: Yes

Actual: No 500 100

Actual: Yes 50 1000

Confusion Matrix c2

Predicted: No Predicted: Yes

Actual: No 540 60

Actual: Yes 250 800

Confusion Matrix c3

Predicted: No Predicted: Yes

Actual: No 570 30

Actual: Yes 450 600

Solution:

For C1 : [0.5 + 0.5 marks]

TPR = 1000/1050 = 0.95

FPR = 100/600 = 0.17

For C2 : [0.5 + 0.5 marks]

TPR = 850/1050 = 0.76

FPR = 60/600 = 0.10

For C3 : [0.5 + 0.5 marks]

TPR = 600/1050 = 0.57

FPR = 30/600 = 0.05

The red line is the reference line.

The point (0.17, 0.95) is far away from the red line.

Hence, p1 is the best parameter.

1 mark for the graph. 1 mark for the explanation.

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.