Wednesday, February 8, 2023

Midsemester - Regular - SPA - Question Paper with answers - Jan 2023

 

Birla Institute of Technology & Science, Pilani

Work Integrated Learning Programmes Division

First Semester 2022-2023

 

Mid-Semester Test

(EC-2 Regular – ANSWER KEY)

 

Course No.                  : DSECL ZC556

Course Title                : Stream Processing and Analytics

Nature of Exam           : Open  Book

No. of Pages        =  4

No. of Questions =  5

Weightage                   : 30%

Duration                      : 2 Hours 

Date of Exam              : 06/03/2021 or 19/03/2021  (FN/AN)

Note to Students:

1.       Please follow all the Instructions to Candidates given on the cover page of the answer book.

2.       All parts of a question should be answered consecutively. Each answer should start from a fresh page. 

3.       Assumptions made if any, should be stated clearly at the beginning of your answer.

 

Q1 Consider an Apache Zookeeper with different number of servers is set up. Discuss the consequences (if any) briefly, for each of the following server configuration in ensemble. [1 + 1 + 1 = 3]

a)      Single server

b)      Two servers

c)      Odd number of servers (more than one)

Answer:

a)      Single server – single point of failure

b)      Two servers – fault tolerance provided against failures but getting consensus while maintaining the state will be difficult as both of them has to agree for the same

c)      Odd number of servers – more fault tolerance and getting consensus will be easy as quorum will be easily obtained with odd number, but increasing the number of servers will have adverse impact on the performance

Q2. Imagine that you are building a next-generation telecom company, one that help users to avoid wastage of data from their existing data package subscription. First step is to deploy smart agent at customers’ phone. These agents will report the application data use of the phone every hour, along with information on which applications are consuming the more data. Based on this, we would like to offer customers variable pricing in real time, based on when and how they run their favorite applications.                                                                                      [1 + 2 + 1 + 1 + 1 = 6]

a)      Show a sample of the tuple generated out of the agent on phone.

b)      Give two examples of exploratory analysis you will do on this data? Give details.

c)      What can be the target variable in this example?

d)      What sort of data modelling technique can be utilized for it?

e)      Which type of system architecture will be useful in this scenario?

Answer:

a)      {Timestamp, phone_no, {app1:data_usage, app2:data_usage, app3:data_usage,…}}

b)      Each applications hourly / daily / day wise (week-day/week-end) data usage

Day wise data requirements for a device

Categorizing the data usages by types of applications – entertainment, work related

c)      Recommended plan or subscription value

d)      Classification – for plan recommendation , Regression – for plan value

e)      Lambda – Both current and historical data taken into consideration for plan recommendation

 

Q.3 A streaming data application is built to process the events received from different data sources. This processed data needs to be made available to various clients for further usage. This data should be available for six months after it’s processed. The process of storing the data should not cause any impact for the analysis layer. The data persistence can be handled in the batch manner. Propose a suitable storage solution that will help to meet the all above mentioned requirements for data storage.                                                                                                                                  [6]

Answer:



Identifying the long term storage mechanism – 1 mark

Decoupling between Analysis layer and Batch layer – 2 marks

Writing the messages between the batch layer and long term storage – 2 marks

Drawing the overall architecture diagram – 1 mark

 

 

 

Q4. Answer the following questions in brief:                                                                   [2 *3 = 6]

 

a)      Why messaging or data flow layer is required in streaming data systems?

Answer:

To provide the required delivery semantics like at-least-once in the streaming data systems

To prevent the processing layer from huge processing backlog

b)      List four common data processing operations that can be performed on the streaming data.

Answer:

·         Cleaning

·         Filter

·         Join

·         Grouping

 

c)      List down two points which motivates the need of systems like Apache Zookeeper in distributed environments?

Answer:

Managing the metadata of the cluster

Maintaining the distributed state of the cluster

 

Q5. Consider the following Kafka Cluster description.                                      [3 + 1 + 1 + 1 = 6]

·         4 node cluster

·         Name of the Topic: Bus

·         Number of Partitions: 3

·         The Replication factor of ‘Bus: 2

·         3 producers

·         2 consumers

a)      Draw Kafka’s architecture clearly highlighting the following in a block diagram

-producers, consumers, broker, topic and partitions.

Answer:



Producers = 0.5

Consumers = 0.5

Borker = 1

Topic , partition , replica = 1

 

b)      How many consumer groups can be created for this configuration?

Answer:

There can be an indefinite number of consumer groups working simultaneously.

 

c)      What is the maximum number of consumers that each consumer group can have while ensuring maximum parallelism?

Answer:

3 – Each reading from a different partition

 

d)      What is the maximum number of server failures that this setup can handle?

Answer:

3

 

Q6.  Consider the use case of real time traffic and routing service for the vehicles on the road. The traffic and navigation service is expected to handle requests from lot of cars. [1+1+1=3]

a)         What will be the appropriate interaction pattern in the collection tier?

b)         Suggest message delivery semantics for this problem.

c)         List out storage requirements with respect to analytics tier.

 

Answer:

a)         The suitable interaction pattern in this case would be -request response pattern because the navigation service can handle the requests pertaining to the the traffic and alternate routes for the cars simultaneously because in this usecase ability to asynchronously request data and process the data as it arrives will be critical.

b)         Message delivery semantics is atmost once because:

In case communication between the collection tier and analysis tier is interrupted for a longer time, there could be delay of service to the vehicle drivers and there is no life risk when system fails (low tolerance). Also it might require small chunk of data storage daily basis. The locality specific data is only under consideration

c)         Storage Solution: data can be stored and refreshed in periodic intervals of time. Hence message queues would suffice instead of long term storage for further analysis. Analysis on congested areas, locations and time stamps can be used to develop predictive models for later use for which recent past data would be sufficient.

 

********************




---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Monday, January 30, 2023

BITS-WILP-DSAD-Regular-2023-Mid Semester









---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

BITS-WILP-MFDS-Regular 2023 - Jan 2023- Mid semester





---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Tuesday, January 24, 2023

Midsemester - Information Retrieval -- DSECLZG537 - Jan 7th 2023

Information Retrieval - Regular - Mid Semester conducted on 7th Jan 2023









---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Monday, January 23, 2023

Regular - Mid Semester - Deep Learning - DSECLZG524 - 7th Jan 2023

Regular - Mid Semester - Deep Learning - DSECLZG524 - 7th Jan 2023







---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Sunday, January 22, 2023

Midsemester - Regular - SPA - Question Paper

BITS - Mid Semester - SPA - Regular - 21st Jan 2023







---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Tuesday, January 17, 2023

Data Delivery Semantic - SPA

Data Delivery Semantic

Three options for data delivery and processing
 At most once delivery
 At least once delivery
 Exactly once delivery

At most once delivery semantic

• Systems used for monitoring purposes
• Important to inform the admins about the problems
• Not all data transmissions required
• Down-sample the data to improve performance
• Data loss is approximately known

Exactly once delivery semantic

• Financial systems or advertising systems
• Every message has to be delivered only once
• Data loss not affordable as it might be revenue loss
• Achieved through queuing systems like ActiveMQ, RabbitMQ
• Usually queue semantics implemented on server side

At least once delivery semantic

• Balance two extremes by providing reliable message delivery by pushing the message handling
semantics to the consumer
• Consumers are free to implement message handling without bothered about other consumers
• Dependent on application logic and handled in application level only


---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.Host/author is not responsible for these posts.

Kappa Architecture - SPA


Kappa Architecture

• The idea is to handle both real-time data processing and continuous reprocessing in a single
stream processing engine.
• This requires that the incoming data stream can be replayed (very quickly), either in its entirety
or from a specific position.
• If there are any code changes, then a second stream process would replay all previous data
through the latest real-time engine and replace the data stored in the serving layer.
• This architecture attempts to simplify by only keeping one code base rather than manage one
for each batch and speed layers in the Lambda Architecture.
• In addition, queries only need to look in a single serving location instead of going against batch
and speed views.






Pros
• Kappa architecture can be used to develop data systems that are online learners and
therefore don’t need the batch layer.
• Re-processing is required only when the code changes.
• It can be deployed with fixed memory.
• It can be used for horizontally scalable systems.
• Fewer resources are required as the machine learning is being done on the real time basis.
Cons
• Absence of batch layer might result in errors during data processing or while updating the
database that requires having an exception manager to reprocess the data or reconciliation.




---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.Host/author is not responsible for these posts.

Lambda Architecture -- SPA

Basic Flow of Events Of Lamda Architecture

All data entering the system is dispatched to both the batch layer and the speed layer for
processing.
2. The batch layer has two functions:
(i) managing the master dataset (an immutable, append-only set of raw data)
(ii) to pre-compute the batch views.
3. The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc
way.
4. The speed layer compensates for the high latency of updates to the serving layer and deals
with recent data only.
5. Any incoming query can be answered by merging results from batch views and real-time views

Batch Layer

New data comes continuously, as a feed to the data system.
• It gets fed to the batch layer and the speed layer simultaneously.
• It looks at all the data at once and eventually corrects the data in the stream layer.
• Here we can find lots of ETL and a traditional data warehouse.
• This layer is built using a predefined schedule, usually once or twice a day.
• The batch layer has two very important functions:
• To manage the master dataset
• To pre-compute the batch views.




Speed Layer (Stream Layer)

• This layer handles the data that are not already delivered in the batch view due to the latency of
the batch layer.
• In addition, it only deals with recent data in order to provide a complete view of the data to the
user by creating real-time views.
• Speed layer provides the outputs on the basis enrichment process and supports the serving
layer to reduce the latency in responding the queries.
• As obvious from its name the speed layer has low latency because it deals with the real time
data only and has less computational load.

Serving Layer

• The outputs from batch layer in the form of batch views and from speed layer in the form of
near-real time views are forwarded to the serving layer.
• This layer indexes the batch views so that they can be queried in low-latency on an ad-hoc
basis.

 Pros
• Batch layer of Lambda architecture manages historical data with the fault tolerant
distributed storage which ensures low possibility of errors even if the system crashes.
• It is a good balance of speed and reliability.
• Fault tolerant and scalable architecture for data processing.
 Cons
• It can result in coding overhead due to involvement of comprehensive processing.
• Re-processes every batch cycle which is not beneficial in certain scenarios.
• A data modelled with Lambda architecture is difficult to migrate or reorganize.




---------------------------------------------------------------------------- 
All the messages 
below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.Host/author is not responsible for these posts.

Generalized Architecture - SPA

Streaming Data System Architecture Components
• Collection
• Data Flow
• Processing
• Storage
• Delivery




Collection System 

• Mostly communication over TCP/IP network using HTTP
• Websites log data was the initial days use case
• W3C standard log data format was used
• Newer formats like JSON, AVRO, Thrift are available now
• Collection happens at specialized servers called edge servers
• Collection process is usually application specific
• New servers integrates directly with data flow systems
• Old servers may or may not integrate directly with data flow systems

Data Flow Tier

• Separation between collection tier and processing layer is required
• Rates at which these systems works are different
• What if one of system is not able to cope with another system?
• Required intermediate layer that takes responsibility of
• accepting messages / events from collection layer
• providing those messages / events to processing layer
• Real time interface to data layer for both producers and consumers of data
• Helps in guaranteeing the “at least once” semantics

Processing / Analytics Tier

• Based on “data locality” principle
• Move the software / code to a the location of data
• Rely on distributed processing of data
• Framework does the most of the heavy lifting of data partitioning, job scheduling, job managing
• Available Frameworks
• Apache Storm
• Apache Spark (Streaming)
• Apache Kafka Streaming etc

Storage Tier

In memory or permanent
• Usually in memory as data is processed once
• But can have use cases where events / outcomes needs to be persisted as well
• NoSQL databases becoming popular choice for permanent storage
• MongoDB
• Cassandra
• But usage varies as per the use case, still no database that fits all use cases

Delivery Layer  

• Usually web based interface
    • Now a days mobile interfaces are becoming quite popular
    • Dashboards are built with streaming visualizations that gets continuously updated as
underlying events are processed
    • HTML + CSS + Java script + Websockets can be used to create interfaces and update
them
    • HTML5 elements can be used to render interfaces
    • SVG, PDF formats used to render the outcomes
• Monitoring / Alerting Use cases
• Feeding data to downstream applications






---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.Host/author is not responsible for these posts.

Monday, January 9, 2023

BITS-WILP-BDS-Regular 2023-Mid Semester

===================================================================

Name : MTDSE CLUSTER-MID SEM-1st Sem 2022-2023 EC-2R Jan23
Subject : DSECLZG522 
               BIG DATA SYSTEMS EC 2R
===================================================================

Q1. Discuss briefly 3 key issues that will impact the performance of a data parallel application and need careful optimization. 

Q2. The CPU of a movie streaming server has L1 cache reference of 0.5 ns and main memory reference of 100 ns. The L1 cache hit during peak hours was found to be 23% of the total memory references.         [Marks: 4]

  1. Calculate the cache hit ratio h.
  2. Find out the average time (Tavg) to access the memory.
  3. If the size of the cache memory is doubled, what will be the impact on h and Tavg.
  4. If there is a total failure of the cache memory, calculate h and Tavg.

Q3. A travel review site stores (user, hotel, review) tuples in a data store. E.g. tuple is (“user1”, “hotel ABC”, “<review>”). The data analysis team wants to know which user has written the most reviews and the hotel that has been reviewed the most. Write MapReduce pseudo-code to answer this question.         [Marks: 4]

Q4. An e-commerce site stores (user, product, rating) tuples for data analysis. E.g. tuple is (“user1”, “product_x”, 3), where rating is from 1-10 with 10 being the best. A user can rate many products and products can be rated by many users. Write MapReduce pseudo-code to find the range (min and max) of ratings received for each product. So each output record contains (<product>, <min rating> to <max rating>).              [Marks: 4]

Q5. Name a system and explain how it utilises the concepts of data and tree parallelism.           [Marks: 3]

Q6. An enterprise application consists of a 2 node active-active application server cluster connected to a 2 node active-passive database (DB) cluster. Both tiers need to be working for the system to be available. Over a long period of time it has been observed that an application server node fails every 100 days and a DB server node fails every 50 days. A passive DB node takes 12 hours to take over from the failed active node. Answer the following questions.            [Marks: 4]

  1. What is the overall MTTF of the 2-tier system ?
  2. Assume only a single failure at any time, either in the App tier or in the DB tier, and an equal probability of an App or a DB node failure. What is your estimate of the availability of the 2-tier system ?

Q7. In the following application scenarios, point out what is most important - consistency or availability, when a system failure results in a network partition in the backend distributed DB. Explain briefly the reason behind your answer.          [Marks: 4]

(a) A limited quantity discount offer on a product for 100 items at an online retail store is almost 98% claimed. (b) An online survey application records inputs from millions of users across the globe.
(c) A travel reservation website is trying to sell rooms at a destination that is seeing very few bookings.
(d) A multi-player game with virtual avatars and users from all across the world needs a set of sequential steps between team members to progress across game milestones.

Q8. Assume that you have a NoSQL database with 3 nodes and a configurable replication factor (RF). R is the number of replicas that participate to return a Read request. W is the number of replicas that need to be updated to acknowledge a Write request. In each of the cases below explain why data is consistent or in-consistent for read requests.               [Marks: 4]

1. RF=1, R=1, W=1.
2. RF=2, R=1, W=Majority/Quorum.
3. RF=3, R=2, W=Majority/Quorum.
4. RF=3, R=Majority/Quorum, W=3.





---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

BITS-WILP-SPA-Makeup 2021- Final Semester

Birla Institute of Technology & Science, Pilani

Work Integrated Learning Programmes Division

Second Semester 2020-2021


Comprehensive Examination

(EC-3 Make-up)


Course No. :  DSECL ZG556

Course Title :  STREAM PROCESSING AND ANALYTICS

Nature of Exam :  Open Book 

Weightage :  45% 

Duration :  2 Hours 

Date of Exam :  11-09-2021  FN

Note to Students: 

  1. Please follow all the Instructions to Candidates given on the cover page of the answer book.

  2. All parts of a question should be answered consecutively. Each answer should start from a fresh page.  

  3. Assumptions made if any, should be stated clearly at the beginning of your answer. 

 


Q1. Every day a multinational online taxi dispatch company gathers terabytes of event data from its mobile users. By using Kafka, Spark Streaming, and HDFS, to build a continuous ETL pipeline, they can convert raw unstructured event data into structured data as it is collected, and then use it for further and more complex analytics.                                                                                                                            [5 + 5 = 10]

  1. With this scenario in mind, explain how Spark Streaming will be leveraged as solution using a nicely labelled architecture diagram? 

  2. List and briefly explain the Apache Spark API's that can be used in? 

Q.2. Consider the following Kafka Cluster description.                                      

  • 10 node cluster

  • Name of the Topic: Cluster

  • Number of Partitions: 4

  • The Replication factor of ‘Bus: 3

  • 7 producers

  • 5 consumers

  1. Draw Kafka’s architecture clearly highlighting the following in a block diagram

-producers, consumers, broker, topic and partitions. 

  1. How many consumer groups can be created for this configuration?

  2. What is the maximum number of consumers that each consumer group can have while ensuring maximum parallelism?

  3. What is the maximum number of server failures that this setup can handle?

[2 + 1 + 1 + 1 = 5]


Q.3. Explain the various components available in the Apache Flink with suitable real time example.                                                    

                                                                                                                                                         [10]


Q4. Consider an online ecommerce portal where customers can search for the products anonymously but for placing the order, they need to have the account with the provider. When customers are browsing the products on the portal, their online behavior is getting monitored by the provider. The provider has business relationship with another online movie service provider whose movies are also displayed and sold on the provider’s platform. Also the users search queries are shared between these providers. The search queries are also matched with the users profile to provide product / movie recommendations to the users. For this purpose it makes use of Apache Storm as streaming platform. With the help of suitable architectural diagram, represent how this recommendation activity can be carried out.                                                                    [8]                                                                                                                                



Q5. Consider the following streaming SQL query where an output record (or row) is generated specifying the updates to the minimum and maximum temperatures over the window W1, plus an incrementally updated average for the temperature over that period.                                                                                    [3 * 4 = 12]


   SELECT STREAM    

         MIN(TEMP) OVER W1 AS WMIN_TEMP,

         MAX(TEMP) OVER W1 AS WMAX_TEMP,

         AVG(TEMP) OVER W1 AS WAVG_TEMP

    FROM WEATHERSTREAM

    WINDOW W1


Let’s assume that input streaming weather stream has following temperature values coming in at regular interval of two minutes:

{12, 14, 15, 13, 16, 20}


What will be the output of the above query (with proper explanation) if

  1. Window is defined as sliding window of length 3 

  2. Window is defined as batch window of length 3 

  3. Window is defined as sliding window of time 4 minutes

  4. Window is defined as batch window of time 3 minutes




---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.