Birla Institute of
Technology & Science, Pilani
Work Integrated
Learning Programmes Division
First Semester 2022-2023
Mid-Semester Test
(EC-2 Regular – ANSWER KEY)
Course No. :
DSECL ZC556
Course Title :
Stream Processing and Analytics
Nature of Exam :
Open Book
No. of Pages = 4 No. of Questions = 5
|
Duration :
2 Hours
Date of Exam : 06/03/2021 or 19/03/2021 (FN/AN)
Note to Students:
1. Please follow all the Instructions
to Candidates given on the cover page of the answer book.
2. All parts of a question
should be answered consecutively. Each answer should start from a fresh
page.
3. Assumptions made if any,
should be stated clearly at the beginning of your answer.
Q1 Consider an Apache Zookeeper
with different number of servers is set up. Discuss the consequences (if any)
briefly, for each of the following server configuration in ensemble. [1 + 1 + 1 = 3]
a)
Single server
b)
Two servers
c)
Odd number of servers (more than one)
Answer:
a) Single server – single point of failure
b) Two servers – fault tolerance provided against failures but getting
consensus while maintaining the state will be difficult as both of them has to
agree for the same
c) Odd number of servers – more fault tolerance and getting consensus
will be easy as quorum will be easily obtained with odd number, but increasing
the number of servers will have adverse impact on the performance
Q2. Imagine that you are building
a next-generation telecom company, one that help users to avoid wastage of data
from their existing data package subscription. First step is to deploy smart
agent at customers’ phone. These agents will report the application data use of
the phone every hour, along with information on which applications are
consuming the more data. Based on this, we would like to offer customers
variable pricing in real time, based on when and how they run their favorite
applications.
[1 + 2 + 1 + 1 + 1 = 6]
a)
Show a sample of the tuple generated out
of the agent on phone.
b)
Give two examples of exploratory analysis
you will do on this data? Give details.
c)
What can be the target variable in this
example?
d)
What sort of data modelling technique can
be utilized for it?
e)
Which type of system architecture will be
useful in this scenario?
Answer:
a) {Timestamp, phone_no, {app1:data_usage, app2:data_usage,
app3:data_usage,…}}
b)
Each applications hourly / daily
/ day wise (week-day/week-end) data usage
Day wise data
requirements for a device
Categorizing
the data usages by types of applications – entertainment, work related
c) Recommended plan or subscription value
d) Classification
– for plan recommendation , Regression – for plan value
e) Lambda –
Both current and historical data taken into consideration for plan
recommendation
Q.3 A streaming data application
is built to process the events received from different data sources. This
processed data needs to be made available to various clients for further usage.
This data should be available for six months after it’s processed. The process
of storing the data should not cause any impact for the analysis layer. The
data persistence can be handled in the batch manner. Propose a suitable storage
solution that will help to meet the all above mentioned requirements for data
storage.
[6]
Answer:
Identifying
the long term storage mechanism – 1 mark
Decoupling
between Analysis layer and Batch layer – 2 marks
Writing
the messages between the batch layer and long term storage – 2 marks
Drawing
the overall architecture diagram – 1 mark
Q4. Answer the following
questions in brief: [2
*3 = 6]
a) Why
messaging or data flow layer is required in streaming data systems?
Answer:
To provide the required delivery semantics like at-least-once in the
streaming data systems
To prevent the processing layer from huge processing backlog
b) List four common data processing operations
that can be performed on the streaming data.
Answer:
·
Cleaning
·
Filter
·
Join
·
Grouping
c) List
down two points which motivates the need of systems like Apache Zookeeper in
distributed environments?
Answer:
Managing the metadata of the cluster
Maintaining the distributed state of the cluster
Q5. Consider the following Kafka
Cluster description. [3 + 1 + 1 + 1 = 6]
·
4 node cluster
·
Name of the Topic: Bus
·
Number of Partitions: 3
·
The Replication factor of ‘Bus: 2
·
3 producers
·
2 consumers
a) Draw
Kafka’s architecture clearly highlighting the following in a block diagram
-producers,
consumers, broker, topic and partitions.
Answer:
Producers
= 0.5
Consumers
= 0.5
Borker
= 1
Topic
, partition , replica = 1
b) How
many consumer groups can be created for this configuration?
Answer:
There can be an indefinite number of
consumer groups working simultaneously.
c)
What is the maximum number of consumers that each
consumer group can have while ensuring maximum parallelism?
Answer:
3 – Each reading from a different partition
d)
What is the maximum number of server failures that this
setup can handle?
Answer:
3
Q6. Consider the use
case of real time traffic and routing service for the vehicles on the road. The
traffic and navigation service is expected to handle requests from lot of cars.
[1+1+1=3]
a) What will be
the appropriate interaction pattern in the collection tier?
b) Suggest
message delivery semantics for this problem.
c) List out
storage requirements with respect to analytics tier.
Answer:
a) The
suitable interaction pattern in this case would be -request response pattern
because the navigation service can handle the requests pertaining to the the
traffic and alternate routes for the cars simultaneously because in this
usecase ability to asynchronously request data and process the data as it
arrives will be critical.
b) Message
delivery semantics is atmost once because:
In case communication between the
collection tier and analysis tier is interrupted for a longer time, there could
be delay of service to the vehicle drivers and there is no life risk when
system fails (low tolerance). Also it might require small chunk of data storage
daily basis. The locality specific data is only under consideration
c) Storage
Solution: data can be stored and refreshed in periodic intervals of time. Hence
message queues would suffice instead of long term storage for further analysis.
Analysis on congested areas, locations and time stamps can be used to develop
predictive models for later use for which recent past data would be sufficient.
********************