Showing posts with label Question. Show all posts
Showing posts with label Question. Show all posts

Wednesday, February 8, 2023

Midsemester - Regular - SPA - Question Paper with answers - Jan 2023

 

Birla Institute of Technology & Science, Pilani

Work Integrated Learning Programmes Division

First Semester 2022-2023

 

Mid-Semester Test

(EC-2 Regular – ANSWER KEY)

 

Course No.                  : DSECL ZC556

Course Title                : Stream Processing and Analytics

Nature of Exam           : Open  Book

No. of Pages        =  4

No. of Questions =  5

Weightage                   : 30%

Duration                      : 2 Hours 

Date of Exam              : 06/03/2021 or 19/03/2021  (FN/AN)

Note to Students:

1.       Please follow all the Instructions to Candidates given on the cover page of the answer book.

2.       All parts of a question should be answered consecutively. Each answer should start from a fresh page. 

3.       Assumptions made if any, should be stated clearly at the beginning of your answer.

 

Q1 Consider an Apache Zookeeper with different number of servers is set up. Discuss the consequences (if any) briefly, for each of the following server configuration in ensemble. [1 + 1 + 1 = 3]

a)      Single server

b)      Two servers

c)      Odd number of servers (more than one)

Answer:

a)      Single server – single point of failure

b)      Two servers – fault tolerance provided against failures but getting consensus while maintaining the state will be difficult as both of them has to agree for the same

c)      Odd number of servers – more fault tolerance and getting consensus will be easy as quorum will be easily obtained with odd number, but increasing the number of servers will have adverse impact on the performance

Q2. Imagine that you are building a next-generation telecom company, one that help users to avoid wastage of data from their existing data package subscription. First step is to deploy smart agent at customers’ phone. These agents will report the application data use of the phone every hour, along with information on which applications are consuming the more data. Based on this, we would like to offer customers variable pricing in real time, based on when and how they run their favorite applications.                                                                                      [1 + 2 + 1 + 1 + 1 = 6]

a)      Show a sample of the tuple generated out of the agent on phone.

b)      Give two examples of exploratory analysis you will do on this data? Give details.

c)      What can be the target variable in this example?

d)      What sort of data modelling technique can be utilized for it?

e)      Which type of system architecture will be useful in this scenario?

Answer:

a)      {Timestamp, phone_no, {app1:data_usage, app2:data_usage, app3:data_usage,…}}

b)      Each applications hourly / daily / day wise (week-day/week-end) data usage

Day wise data requirements for a device

Categorizing the data usages by types of applications – entertainment, work related

c)      Recommended plan or subscription value

d)      Classification – for plan recommendation , Regression – for plan value

e)      Lambda – Both current and historical data taken into consideration for plan recommendation

 

Q.3 A streaming data application is built to process the events received from different data sources. This processed data needs to be made available to various clients for further usage. This data should be available for six months after it’s processed. The process of storing the data should not cause any impact for the analysis layer. The data persistence can be handled in the batch manner. Propose a suitable storage solution that will help to meet the all above mentioned requirements for data storage.                                                                                                                                  [6]

Answer:



Identifying the long term storage mechanism – 1 mark

Decoupling between Analysis layer and Batch layer – 2 marks

Writing the messages between the batch layer and long term storage – 2 marks

Drawing the overall architecture diagram – 1 mark

 

 

 

Q4. Answer the following questions in brief:                                                                   [2 *3 = 6]

 

a)      Why messaging or data flow layer is required in streaming data systems?

Answer:

To provide the required delivery semantics like at-least-once in the streaming data systems

To prevent the processing layer from huge processing backlog

b)      List four common data processing operations that can be performed on the streaming data.

Answer:

·         Cleaning

·         Filter

·         Join

·         Grouping

 

c)      List down two points which motivates the need of systems like Apache Zookeeper in distributed environments?

Answer:

Managing the metadata of the cluster

Maintaining the distributed state of the cluster

 

Q5. Consider the following Kafka Cluster description.                                      [3 + 1 + 1 + 1 = 6]

·         4 node cluster

·         Name of the Topic: Bus

·         Number of Partitions: 3

·         The Replication factor of ‘Bus: 2

·         3 producers

·         2 consumers

a)      Draw Kafka’s architecture clearly highlighting the following in a block diagram

-producers, consumers, broker, topic and partitions.

Answer:



Producers = 0.5

Consumers = 0.5

Borker = 1

Topic , partition , replica = 1

 

b)      How many consumer groups can be created for this configuration?

Answer:

There can be an indefinite number of consumer groups working simultaneously.

 

c)      What is the maximum number of consumers that each consumer group can have while ensuring maximum parallelism?

Answer:

3 – Each reading from a different partition

 

d)      What is the maximum number of server failures that this setup can handle?

Answer:

3

 

Q6.  Consider the use case of real time traffic and routing service for the vehicles on the road. The traffic and navigation service is expected to handle requests from lot of cars. [1+1+1=3]

a)         What will be the appropriate interaction pattern in the collection tier?

b)         Suggest message delivery semantics for this problem.

c)         List out storage requirements with respect to analytics tier.

 

Answer:

a)         The suitable interaction pattern in this case would be -request response pattern because the navigation service can handle the requests pertaining to the the traffic and alternate routes for the cars simultaneously because in this usecase ability to asynchronously request data and process the data as it arrives will be critical.

b)         Message delivery semantics is atmost once because:

In case communication between the collection tier and analysis tier is interrupted for a longer time, there could be delay of service to the vehicle drivers and there is no life risk when system fails (low tolerance). Also it might require small chunk of data storage daily basis. The locality specific data is only under consideration

c)         Storage Solution: data can be stored and refreshed in periodic intervals of time. Hence message queues would suffice instead of long term storage for further analysis. Analysis on congested areas, locations and time stamps can be used to develop predictive models for later use for which recent past data would be sufficient.

 

********************




---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Tuesday, January 24, 2023

Midsemester - Information Retrieval -- DSECLZG537 - Jan 7th 2023

Information Retrieval - Regular - Mid Semester conducted on 7th Jan 2023









---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Sunday, January 22, 2023

Midsemester - Regular - SPA - Question Paper

BITS - Mid Semester - SPA - Regular - 21st Jan 2023







---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Sunday, January 1, 2023

Information Retrieval -- DSECLZG537 - Mid Semester Question Paper - June 2021

 

Birla Institute of Technology & Science, Pilani

Work-Integrated Learning Programmes Division

June 2021

Mid-Semester Test

(EC-1 Regular)

Text Box: No. of Pages        = 2
No. of Questions = 2

 


Course No.                   : SS ZG537  

Course Title                  : INFORMATION RETRIEVAL  

Nature of Exam            : Closed Book

Weightage                    : 30%

 

Note:

1.       Please follow all the Instructions to Candidates given on the cover page of the answer book.

2.       All parts of a question should be answered consecutively. Each answer should start from a fresh page. 

3.       Assumptions made if any, should be stated clearly at the beginning of your answer.

 

Q1 – 2+5+3+5=15 Marks

A) Give an example of uncertainty and vagueness issues in Information retrieval [2 Marks]               

 

B) Explain the merge algorithm for the query “Information Retrieval”? What is the best order for query processing for the query “BITS AND Information AND Retrieval”? What Documents will be returned as output from the 15 documents? [5 Marks]



 


Solution:

Merge Algorithm - Intersecting two posting lists : Algorithm


Output document - 11

 

C) [3 Marks]

 

D) Build inverted index using Blocked sort-based Indexing for 50 million records. Explain the algorithm in detail with respect to indexing 50 million records.                            [5 Marks]

 

 

Q2 – 5+5+5=15 Marks

A)    Assume a corpus of 10000 documents.  The following table gives the TF and DF values for the 3 terms in the corpus of documents. Calculate the logarithmic TF-IDF values.                                                                                                           [5 Marks]

 

Term

Doc1

Doc2

Doc3

bits

15

5

20

pilani

2

20

0

mtech

0

20

15

 

Term

dft

 

bits

2000

pilani

1500

mtech

500

 

 

 

 

 

B) Classify the test document d6 into c1 or c2 using naïve bayes classifier. The documents in the training set and the appropriate class label is given below.  [5 Marks]

                                                                                     

 

 

Docid

Words in document

c= c1

c= c2

Training Set

d1

positive

Yes

No

 

d2

Very positive

Yes

No

 

d3

Positive very positive 

Yes

No

 

d4

very negative

No

Yes

 

d5

negative

No

Yes

Test Set

d6

Negative positive very positive

?

?

 

C) The search engine ranked results on 0-5 relevance scale: 2, 2, 3, 0, 5. Calculate the NDCG metric for the same. [5 Marks]

                                                                                                                      

 





---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.