Thursday, July 7, 2022

BITS-WILP-Machine Learning - ML - Comprehensive Examination - 2019-2020

Birla Institute of Technology & Science, Pilani

Work Integrated Learning Programmes Division

Second Semester 2019-20

M.Tech. (Data Science and Engineering)

Comprehensive Examination (Makeup)



Course No.         : DSECLZG565

Course Title         : MACHINE LEARNING  

Nature of Exam        : Open Book 

Weightage         : 40% 

Duration         : 2 Hours  

Date of Exam:           July 12, 2020                            Time of Exam: 10:00 AM – 12:00 PM


Note: Assumptions made if any, should be stated clearly at the beginning of your answer. 


Question 1.    [3+3+2+3=11 Marks]   

             

  1. Suppose you receive messages in sequence of bits (0’s and 1’s) with unknown bias θ for 1’s; there is a message sequence as x1, x2, ..., xn of length n is received.

What θ maximizes the likelihood of the data observed (in terms of n) ? Assume that sample x1, x2, ..., xn is from a parametric distribution f (x|θ), where f (x|θ) is the Bernoulli probability mass function with parameter θ. [3 marks]



    1. In context of naive Bayes, what is meant by Laplace smoothing? [1 mark]

  1. Handling extremely low probabilities. 

  2. None of these 

  3. Make zero probabilities non-zero. 

  4. Making probabilities zero.

  1. Why Naïve Bayes algorithm is called so?    [2 marks]


  1. Consider fitting a logistic regression model to predict whether a customer will default the bank loan or not given his bank balance, income and whether student/non-student. The optimal model coefficients are: Intercept = -10.86, balance = 0.0057* balance, income = 0.0030 and student = -0.6468. Predict whether a student with balance of Rs.1500 and an income of Rs 40,000 will default or not. [2 marks]


  1. The regression line for predicting weight from height is height=1.51*weight+45.47.   Heights is in cm  and weights in kg Interpret the equation and find the height of a person whose weight is 100kgs [2+1=3marks]


Question 2.  [2+5=7 Marks]   

An odd parity generator outputs a ‘1’ when sum of ‘1’s in an input binary sequence is odd. 

  1. What are the parity bits P for a binary sequence (x1, x2) of length 2? x1, x2 are either 0 or 1. [2 marks]

  2. Realize an odd parity generator for binary sequence of length 2 using an MLP, with the following logic gate building blocks (with sigmoidal activation function). Show the network architecture with all weights and bias values. [ 1+1+3 = 5 marks]

Question 3.    Answer the following questions. [5+5 =10 Marks]


  1. Consider training an AdaBoost classifier using decision stumps on the following data set. Decision stump classifier chooses a constant value c and classifies all points where x > c as one class and other points where x ≤ c as the other class. 

1. What is the initial weight that is assigned to each data point? [1 marks]

2. Show the decision boundary for the first decision stump (indicate the positive and negative side of the decision boundary).  [2 marks]

3. Circle the point whose weight increases in the boosting process [2 marks]


  1. Suppose you are given the following pairs. You will simulate the k-means algorithm to identify TWO clusters in the data. Suppose you are given initial assignment cluster centre as {cluster1: #1}, {cluster2: #10} – the first data point is used as the first cluster centre and the 10th as the second cluster centre. Please simulate the k-means (k=2) algorithm for one iteration. What are the cluster assignments after one iteration? Assume k-means uses Euclidean distance.

                                     [5 Marks]   



Data #

x

y

1

1.9

0.97

2

1.76

0.84

3

2.32

1.63

4

2.31

2.09

5

1.14

2.11

6

5.02

3.02

7

5.74

3.84

8

2.25

3.47

9

4.71

3.6

10

3.17

4.96

https://lh5.googleusercontent.com/kyhKlQh1YUGccCDMSPCQr1lplWKli0qf6YDG5gH0d_pEEGAbf1MQxqOuCSc2F95Wg6h8JnCxfkXLsTgavIZ5El-ac6kh0OJoPZS82uSnV0YPHzNTfrbQYlpn0ZKH3y2l8qTmQCMG






Question 4. Answer the following questions. [5 Marks]   

Students in a particular class are graded in subjects A, B and C out of 10 points. Based on the information provided in the table below for 8 students, predict using KNN algorithm approach if a student who scored the following grades  A 5; B 7; C 6 will pass or fail?

  1. When K = 3?

Score in A

Score in B

Score in C

Result

9

5

7

Pass

7

3

6

Fail

5

8

9

Pass

8

6

7

Pass

4

7

8

Fail

6

7

6

Pass

6

8

5

Fail

5

6

5

Fail


Question 5. Answer the following questions. [7 Marks]   


  1. Solve the below and find the equation for hyper plane using linear Support Vector Machine method. 

Positive Points: {(3, 2), (4, 3), (2, 3), (3, -1)}

Negative Points: {(1, 0), (-1, -3), (0, 2), (-1, 2)}






---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

BITS-WILP-Machine Learning - ML - Mid Semester - 2019-2020

Birla Institute of Technology & Science, Pilani

Work-Integrated Learning Programmes Division

Second Semester 2019-2020

M.Tech (Data Science and Engineering)

Mid-Semester Test (EC-2 Regular)



Course No.         :  DSECL ZG565

Course Title         : MACHINE LEARNING  

Nature of Exam     : Closed Book 

Weightage         : 30% 

Duration         : 90 minutes  

Date of Exam         : December 29, 2019 (FN)               

Note: 

  1. Please follow all the Instructions to Candidates given on the cover page of the answer book.

  2. All parts of a question should be answered consecutively. Each answer should start from a fresh page.  

  3. Assumptions made if any, should be stated clearly at the beginning of your answer.

 

Answer All the Questions (only on the pages mentioned against questions. if you need more pages, continue remaining answers from page 20 onwards)       

       

Question 1. [Marks 2+3=5]                   [to be answered only on pages 3-5]

a) What are the steps in designing a machine learning system (2 marks)


b) A survey was conducted of 200 families to observe the relationship between average annual income per year and whether the family will buy car or not.  Consider the following table:


Income below

Rs 10 lakhs

Income >=

Rs 10 lakhs

Total

Buyer

38

42

80

Non-Buyer

82

38

120

Total

120

80

200


  1. What is the probability that randomly selected family is a buyer?            (1 marks)

  2. What is the probability that a randomly selected family is both buyer of the car and has income of Rs 10 lakh and above?                            (1 mark)

  3. A family selected at random belongs to the category of income greater than Rs 10 lakhs. What is the probability that they will buy a car?                            (1 marks)

Question 2. [Marks =5]                        [to be answered only on pages 6-7]

Consider there are two bags A and B, where A contains 5 white balls and 7 blue balls whereas B contains 2 white and 12 blue balls. We pick bag A, 50% of the time. After an experiment, a white ball is selected. What is the probability that the ball is drawn from bag B?            (5 marks)


Question 3. [Marks=5]               [to be answered only on pages 8-9]

Given the following labelled training data, 


Flat 20% Cashback on Oyo Room bookings done via Paytm. (SPAM)

Lets Talk Fashion! Get flat 40% Cashback on Backpacks (SPAM)

Opportunity with Product firm for Fullstack (HAM)

Javascript Developer, Full Stack Developer in Bangalore (HAM)


Use Naive Bayes Classifier with laplace smoothing to identify classification of the sentence “Scan Paytm QR Code to Pay & Win 100% Cashback”


Question 4. [Marks 2+3=5]                    [to be answered only on pages 10-11]

  1. Explain the cost/error function used in logistic regression         (2 marks)

  2. Compare Probabilistic generative model and probabilistic discriminative models with examples.                                     (3 marks)

Question 5. [Marks 3+2=5]                    [to be answered only on pages 12-14]

  1. Plot cost function J (w) for linear regression y=w1x for the training data pair <0, 0>, 

<0.5, 0.5>, <1, 1>, <1.5, 1.6>                    (3 marks)


  1. Distinguish Bias and variance in the machine learning domain and discuss how model complexity is affected by these two.                (2 marks)


Question 6. [Marks 2+1+2= 5]                 [to be answered only on pages 15-16]

Provide answers based on the following set of training examples

Instance

a1

a2

Classification

1

T

T

+

2

T

T

+

3

T

F

-

4

F

F

+

5

F

T

-

6

F

T

-


  1. What is the entropy of this collection of training examples with respect to the target function classification                                (2 marks)

  2. What is the information gain of a2 relative to these training examples    (1 marks)

  3. Why do we prefer shorter /smaller trees while learning decision tree? Does ID-3 guarantee shorter tree?                                 (2 mark)



---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Saturday, July 2, 2022

BITS - ISM - Important Links -Mid semester

Some links which might be useful for  online calculation (Courtesy by Whatsapp member)

Confidence Interval Calculator:
https://www.calculator.net/confidence-interval-calculator.html

Confidence interval calculator with finite population
https://select-statistics.co.uk/calculators/confidence-interval-calculator-population-mean/

Population Correction Calculator
https://select-statistics.co.uk/calculators/confidence-interval-calculator-population-mean/

Z Score calculator:
https://www.calculator.net/z-score-calculator.html



---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.Host/author is not responsible for these posts.

Thursday, June 30, 2022

BITS-WILP-IDS-Makeup 2021 - Mid semester with Solutions

Birla Institute of Technology and Science, Pilani
Work Integrated Learning Programmes Division
BITS WILP Programme - M. Tech. in Data Science and Engg
I Semester 2020-2021
Mid-Semester Test
(EC2 - Makeup)
Course Number DSECL ZG523
Course Name Introduction to Data Science
Nature of Exam Open Book
Weight-age for grading 30
Duration 1.5 hrs
Date of Exam
# Pages 2
# Questions 5
Instructions
1. All questions are compulsory.
2. Questions are to be answered in the order in which they appear in this paper.
3. All answers must be directed to the question in short and simple paragraphs or bullet points;
use visuals/diagrams wherever necessary.
4. Assumptions made if any, should be stated clearly at the beginning of your answer.


1. Challenge the accuracy of the below statements (agree/disagree) with short justification.
[Answer should contain few bullet points of no more than 2 – 4 lines]. [5]
(a) For Data Collection and Pre-processing, domain Knowledge is essential, whereas for
Model Building and Evaluation it is optional. [2.5]
• Agree [0.5]
• Data Pre-processing (cleansing, outlier detection, imputation, feature engineering,
etc.) expects domain expertise (in the area of the Data/Business) so that Data
Scientist (or Domain expert) can identify relevant/irrelevant features, right techniques
for Data Imputation and take decisions regarding the validity of the source
data by performing EDA, etc. [1]
• Model building is purely technical activity where the focus is exclusively on the
choice of the model, tuning the parameters and evaluating the performance, etc.
using various tools. [1]
(b) In supervised learning, we do not need a separate Validation-set when there already
exists a Test-set. [2.5]
• Disagree [0.5]
• Validation-set is used for internal assessment of Model performance during development
and for hyper-parameter tuning. [1]
1
• Test-set is exclusively reserved for Testing the Model Performance on ‘un-seen’
data (kept locked away, not accessible to Model builder, similar to ‘Acceptance
Test’ in software engineering). [1]


2. You have started your spring cleaning activity. On day 10 you have decided that you
will clean your email inbox. During this cleaning, you noticed that you are paying around
Rs.3000/- per month against your internet bill. You want to reduce the bill amount. Assume
relevant data and state the assumprtions clearly at the beginning of your answer. [6]
(a) Identify the type of data analytics that you will use if you observed that the internet
usage has increased because of the pandemic. [1]
(b) Identify the type of data analytics that you will use if you compared the various bills
and then switched to a different plan that better suits your need. [1]
(c) Write any 4 questions that you will ask yourself, in the step 2. [4]
Solution:
(a) Descriptive analytics / EDA / exploratory [1]
(b) Prescriptive analytics [1]
(c) Any 4 questions suitable for Prescriptive analytics. [4]
i. Do you have any related requirements? If yes, what are those?
ii. What are the existing/related systems within the capability that capture/use related
information?
iii. What are the gaps?
iv. Who are the stakeholders?
v. Who will be affected by this implementation?
vi. Any Licenses/Commercials needed in case of proprietary solutions?
vii. Are there any other dependencies?
viii. Do you see any other problems/challenges?
ix. Does the client have a technology preference?
x. Does the client have limited / unlimited infrastructure?
xi. How will we measure the business impact of the final model? How will we justify
the project expense against the benefits?
xii. Is there a plan for domain expert validation?
xiii. If so, will the model be in a form that they can understand?
xiv. Is there an evaluation metric set up by the business?
xv. Is there a hold-out data (i.e. data used for validation) available?
xvi. Against what baseline or benchmark the results are compared?
xvii. Are the business costs and benefits considered into account?


3. Consider the below data set for carrying out some data science activity. There are four
columns identified by f1, f2, f3, f4,f5 ("f" means feature). In the table "NA" means "Not
Available" (or missing value). Your task as a data engineer is to obtain the missing values
using some imputation technique. The domain expert gave advice that f3 can be imputed
using mean, f4 can be imputed using median and f5 can be imputed using mean based on
the feature f2. Write the steps to be followed to obtain the missing values and re-write the
dataset with the imputed values. [6]

row f1 f2 f3 f4 f5
1 25 A 5 3 37
2 27 B NA 5 38
3 28 A 17 NA 40
4 24 NA 6 4 43
5 29 C NA 6 36
6 22 C 17 5 NA
7 23 A 5 5 NA
Solution:
(a) f2: Categorical data. Hence mode = A is used to impute row 4. [1]
(b) f3: mean of 5,17,6,5 is 32/4=8. Hence, impute rows 2 and 5 can be imputed using 8.
[1]
(c) f4: median = 5. Hence impute row 3 with median 5. [1]
(d) f5: row6 = 36 as there is only one value for C. row7 = mean of 37,40,43 = 40. [1*2=2]
(e) [1]
row f1 f2 f3 f4 f5
1 25 A 5 3 37
2 27 B 8 5 38
3 28 A 17 5 40
4 24 A 6 4 43
5 29 C 8 6 36
6 22 C 17 5 36
7 23 A 5 5 40


4. A motor insurance company was has an investigation team that investigates the claims
made. The fraud investigation team investigates up to 30% of all claims made. The company
realised that it was losing too much money due to fraudulent claims. List four ways
in which predictive data analytics could be used to help address this business problem. For
each proposed approach, describe the predictive model that will be built, how the model
will be used by the business, and how using the model will help address the original business
problem. [8]
Solution:
(a) Claim prediction – Predict the likelihood that an insurance claim is fraudulent. Assign
every newly arising claim a fraud likelihood. Those that are most likely to be fraudulent
could be flagged for investigation by the investigation team. Increase the number of
fraudulent claims detected and reduce the amount of money lost of fraud.
(b) Member prediction – Predict the propensity of member to commit fraud. Run this
is model every quarter to identify those members most likely to commit fraud. The
investigation company could take a risk mitigation action ranging from contacting the
member with some kind of warning to canceling the member’s policies. By identifying
members likelu to make fraudulent claims before they make them, the company could
save significant money.
(c) Application prediction – Predict the likelihood that a policty someone has applied for
will ultimately result in a fraudulent claim. For every new application this model could
be run and reject those applications that are predicted likelu to resiult in a fraudulent
claim.

(d) Payment prediction – Predict the amount most likely to be paid out by the company
after investigating the claim. Run this model, whenever new caims are made. The
policy holder could be offered the amount predicted by the model as settlement as
an alternative to goind through a claims investigating process. The company could
save on claims investigations and reduce the amount of money paid out on fraudulent
claims.


5. While tuning parameters for a classification algorithm, three different confusion matrices
were produced. For parameter p1, the confusion matrix is c1. For parameter p2, the
confusion matrix is c2. For parameter p3, the confusion matrix is c3. Explain how to draw
RoC curve and write the necessary equations. Using the RoC curve, determine the best
parameter among p1, p2, and p3. Show each of the steps clearly. [5]
Confusion Matrix c1
Predicted: No Predicted: Yes
Actual: No 500 100
Actual: Yes 50 1000
Confusion Matrix c2
Predicted: No Predicted: Yes
Actual: No 540 60
Actual: Yes 250 800
Confusion Matrix c3
Predicted: No Predicted: Yes
Actual: No 570 30
Actual: Yes 450 600
Solution:
For C1 : [0.5 + 0.5 marks]
TPR = 1000/1050 = 0.95
FPR = 100/600 = 0.17
For C2 : [0.5 + 0.5 marks]
TPR = 850/1050 = 0.76
FPR = 60/600 = 0.10
For C3 : [0.5 + 0.5 marks]
TPR = 600/1050 = 0.57
FPR = 30/600 = 0.05

The red line is the reference line.
The point (0.17, 0.95) is far away from the red line.
Hence, p1 is the best parameter.
1 mark for the graph. 1 mark for the explanation.











---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.