Showing posts with label DSECLZG556. Show all posts
Showing posts with label DSECLZG556. Show all posts

Tuesday, April 11, 2023

Decaying Window

This algorithm allows you to identify the most popular elements (trending, in other words) in an incoming data stream.


The decaying window algorithm not only tracks the most recurring elements in an incoming data stream, but also discounts any random spikes or spam requests that might have boosted an element’s frequency. In a decaying window, you assign a score or weight to every element of the incoming data stream. Further, you need to calculate the aggregate sum for each distinct element by adding all the weights assigned to that element. The element with the highest total score is listed as trending or the most popular.


  1. Assign each element with a weight/score.
  2. Calculate aggregate sum for each distinct element by adding all the weights assigned to that element.


In a decaying window algorithm, you assign more weight to newer elements. For a new element, you first reduce the weight of all the existing elements by a constant factor k and then assign the new element with a specific weight. The aggregate sum of the decaying exponential weights can be calculated using the following formula:

t1i=0ati(1c)i

Here, c is usually a small constant of the order  or . Whenever a new element, say at+1 , arrives in the data stream you perform the following steps to achieve an updated sum:

1. Multiply the current sum/score by the value (1−c).
2. Add the weight corresponding to the new element.

Generally i is considered as 1 

Weight decays exponentially over time

In a data stream consisting of various elements, you maintain a separate sum for each distinct element. For every incoming element, you multiply the sum of all the existing elements by a value of (1−c). Further, you add the weight of the incoming element to its corresponding aggregate sum.

A threshold can be kept to, ignore elements of weight lesser than that.

Finally, the element with the highest aggregate score is listed as the most popular element.

Example

For example, consider a sequence of twitter tags below:
fifa, ipl, fifa, ipl, ipl, ipl, fifa

Also, let's say each element in sequence has weight of 1.
Let's c be 0.1
The aggregate sum of each tag in the end of above stream will be calculated as below:

fifa

fifa - 1 * (1-0.1) = 0.9
ipl - 0.9 * (1-0.1) + 0 = 0.81 (adding 0 because current tag is different than fifa)
fifa -  0.81 * (1-0.1) + 1 = 1.729 (adding 1 because current tag is fifa only)
ipl - 1.729 * (1-0.1) + 0 = 1.5561
ipl - 1.5561 * (1-0.1) + 0 = 1.4005
ipl - 1.4005 * (1-0.1) + 0 = 1.2605
fifa -  1.2605 * (1-0.1) + 1 = 2.135

ipl

fifa - 0 * (1-0.1) = 0
ipl - 0 * (1-0.1) + 1 = 1 
fifa -  1 * (1-0.1) + 0 = 0.9 (adding 0 because current tag is different than ipl)
ipl - 0.9 * (1-0.01) + 1 = 1.81
ipl - 1.81 * (1-0.01) + 1 = 2.7919
ipl - 2.7919 * (1-0.01) + 1 = 3.764
fifa - 3.764 * (1-0.01) + 0  = 3.7264

In the end of the sequence, we can see the score of fifa is 2.135 but ipl is 3.7264
So, ipl is more trending then fifa
Even though both of them occurred same number of times in input there score is still different.

Advantages of Decaying Window Algorithm:

  1. Sudden spikes or spam data is taken care.
  2. New element is given more weight by this mechanism, to achieve right trending output.

Similar Blog : Nitin Decaying Window

---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Sampling + Reservoir Sampling Algorithm + Hot List Problem + Frequent Algorithm +Space Saving Algorithm


Sampling : subset of data to be analyzed such that it has approximately the same properties of the original data

Random Sampling:Each element has equal probability of being selected
Reservior Sampling : A technique to maintain online random sample

Reservior Sampling Algorithm

1.Used to maintain an online random sample
2.Maintains a sample of size k, known as reservoir
3.Every new element has a probability k/n of replacing an old element in the reservoir
4.Let us say, k=1: An item in a stream of length n replaces the element in the reservoir with probability 1/n




Hot List Problem

Goal to Achieve

1.continuously maintain a list of the top-k most frequent elements in a stream
2.rank the items and absolute count is immaterial

Hot List Problem +  Frequent Algorithm 



Hot List Problem - Space Saving Algorithm




---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Saturday, February 18, 2023

Midsemester - Makeup - SPA - Question Paper with answers - Jan 2023

Birla Institute of Technology & Science, Pilani

Work Integrated Learning Programmes Division

First Semester 2022-2023

 

Mid-Semester Test

(EC-2 Makeup – ANSWER KEY)

 

Course No.                  : DSECL ZC556

Course Title                : Stream Processing and Analytics

Nature of Exam           : Open  Book

No. of Pages        =  4

No. of Questions =  5

Weightage                   : 30%

Duration                      : 2 Hours 

Date of Exam              : 06/03/2021 or 19/03/2021  (FN/AN)

Note to Students:

1.       Please follow all the Instructions to Candidates given on the cover page of the answer book.

2.       All parts of a question should be answered consecutively. Each answer should start from a fresh page. 

3.       Assumptions made if any, should be stated clearly at the beginning of your answer.

 

Q1. Consider an online food ordering and delivery platform which enables the customers to browse the nearby serving restaurants any time, explore their menu options, order the food and get it delivered at their doorsteps, also provide the ratings for the foods. The platform also enables the restaurant owners to analyze the click-stream as well as historical data related to their restaurant so that they can improvise their decisions with respect to the promotional offers made on the platform.                                                                                                           [1 +1 + 2 + 1 + 1 = 6]

 

a)      Identify the different data sources involved in this scenario. Also label them as internal or external.

b)      If you have been asked to design the customer data stream in this scenario, how it will look like?

c)      Give two examples of exploratory data analysis that restaurant owners can perform with this data? Give details.

d)      Recommend any suitable machine learning (with adequate justification) that can help the restaurant owners to improvise their decisions about marketing efforts.

e)      Which type of system architecture will be useful in this scenario?

 

Answer:

a)      Customer profile data, Restaurant data, Orders database, Food delivery agents data

All of them are of type internal to the system

b)      It will be mix of few customer + restaurant + order data

{“timestamp”, “cust_id”, “rest_id”, “order_id”, “items”, “payment_mode”…..}

c)      Some examples -

Find out factors affecting the ratings of the restaurant

Find out the relationship between the items those are ordered together

d)      Few examples –

Clustering – find out customers who are similar in terms of order, location etc

Regression – to predict the customer rating for this order or restaurant or to determine when he will place his next order or estimate the next order pricing

Classification – to predict whether customer will place an order in next n days or not or identify the customer to make an offer / discount

Recommendation – recommend the menu items to the customers

e)      Lambda – historical data for prediction, real time data for ordering and tracking etc.

 

 

Q2. Imagine that you are building a real time traffic routing system. Identify and justify the communication mechanisms used while making data available to the external world in the following use cases.                                                                                                                       [6]

 

a)      People driving around any city can use mobile application to get updates and be re-routed based on up-to-the moment traffic conditions. The user will post the interest for a particular route through mobile application and in turn the system will send the updates about the traffic conditions on that route.

b)      Police department wants to host web application that has a real time dashboard which continuously shows the traffic conditions at several points in the city and take appropriate action to resolve traffic congestions based on the data feeds those are coming from your streaming systems.

c)      A device is fitted into the commercial passenger vehicles which are roaming around the city. This device has the capability to show the current traffic conditions on the display and send the current coordinates of the vehicle and other relevant information back to the streaming system.

Answer:

a)      Webhooks

User registers the interest and application sends back the data to the mobile application

b)      Server sent events (SSE)

Connection between dashboard and system will remain open and data is made available to the client to update the dashboard

c)      WebSockets

Enables the two way communication between the device and streaming system

For each sub-question, Identification of each technique - 1 mark and adequate justification 1 mark

 

Q3. Consider a banking systems that has 4000 ATMs situated across many places in a particular region. The bank customer can carry out many transactions like balance check, money withdrawal and deposit, password reset. These transactions takes 4, 8 and 6 bytes respectively. Each of this transaction has a timestamp attached to it which takes 8 bytes. This transaction data needs to be stored at different places like at ATMs, at an immediate buffer at streaming system side and in a persistent storage. Narrate the three use-cases which require the data to be stored at these three places and estimate the sizing required for the same and suitable option for the storage.            [6]

 

Answer:

a)      At ATM –

May be just transactions data happened in ATM during last 30 minutes – if something goes wrong with any transaction like wrong password coming again and again for same customer, immediately raise an alarm to check that the person trying out the transaction is really card owner or not, local processing at ATM needs to be done

In memory databases can be used but sizing will vary from machine to machine and can be better estimated based on historical transaction data for that machine

 

b)      At buffer – immediate trend analysis wrt money withdrawal patterns and use it for feeding the ATMs with money most of the times

May be a day’s transactions data needs to be persisted in the buffer

Data flow tiers such as Kafka, Flume can be helpful

For storing processed data, caching systems can be useful

 

c)      At database – for historical data analysis , both (a) and (b) will get benefited by this sort of permanent storage like Databases or data stores

Sizing will depend upon business’s opinion about how recent data needs to be taken into consideration. 

For each correct identification – 1 mark and explanation 1 mark

 

Q4. Answer the following questions in brief:                                                                   [2 *3 = 6]

 

a) How Apache Flume blurs boundaries between data motion and processing?

Answer:

Interceptors are where Flume begins to blur line between data motion and processing.  Interceptor model allows Flume not only to modify the metadata headers of an event, but also allows the Event to be filtered according to those headers.

 

b) Mention two important ways by which processed streaming data can be made available to the end users?

Answer:

1) Through dashboards - by sending processed data to the visualizations placed on dashboards

2) Though alerts/notifications - by sending important/exceptions updates to the users through alerts or other notification channels

 

c)                  “Apache Kafka adopts a prescriptive order for reading and writing operations for a topic”. Justify this statement mentioning whether it is true or false. 

Answer:

True. Apache Kafka is not a queuing system like ActiveMQ. It does not follow the semantics that messages get processed as they are arrived. Kafka’s partitioning system does not maintain such structure.

 

Q5. Consider the following block diagram of data flow system based on Apache Flume. [1 +4 + 1 =6]

a)      Identify essential components from the perspective of Flume Agent.

b)      Provide the suitable configuration details for Apache Flume Agent that matches this data flow scenario. 

c)      Which type of data flow is represented with this block diagram?

Answer:

a)       

Three essential components of Flume Agent

·         Sources – Avro source, Thrift source and Syslog source for port monitoring data

·         Channels – Memory channel

·         Sink – HDFS sink, ElasticSearch sink

Identification of sources, channels and sinks – 1 mark

b)

Agent configuration should have mention of three sources, channel and sinks

myAgent.sources = myAvroSource, myThriftSource, mySyslogSource

myAgent.channels = myMemoryChannel

myAgent.sinks = myHDFSSink, myESSink

 

myAgent.source. myAvroSource.type = avro

myAgent.source. myThriftSource.type = thrift

myAgent.source. mySyslogSource.type = syslogtcp

 

myAgent.sink. myHDFSSink.type = hdfs

myAgent.sink. myESSink.type = elasticsearch

 

myAgent.channels. myMemoryChannel.type =memory

 

myagent.sources.myAvroSource.channel = myMemoryChannel

myagent.sources. myThriftSource.channel = myMemoryChannel

myagent.sources. mySyslogSource.channel = myMemoryChannel

myagent.sinks. myHDFSSink.channel = myMemoryChannel

myagent. sinks. myESSink.channel = myMemoryChannel

 

Definition of source, sink and channel – 0.5 mark

Configuration of source – 1 mark

Configuration of channel – 0.5 mark

Configuration of sink – 1 mark

Mapping between source/sink and channel – 1 mark

 

 

b)      Fan-in flow – from multiple sources to single channel

 

********************



---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Comprehensive- Regular - SPA - Question Paper with answers - Mar2021

 

Birla Institute of Technology & Science, Pilani

Work Integrated Learning Programmes Division

Second Semester 2021-2022

 

Comprehensive Test

(EC-3 Regular)

 

Course No.                  : DSECLZC556

Course Title                : Stream Processing and Analytics

Nature of Exam           : Open Book

No. of Pages        =  4

No. of Questions =  4

Weightage                   : 40%

Duration                      : 2 Hours 

Date of Exam              : 06/03/2021 or 19/03/2021  (FN/AN)

Note to Students:

1.       Please follow all the Instructions to Candidates given on the cover page of the answer book.

2.       All parts of a question should be answered consecutively. Each answer should start from a fresh page. 

3.       Assumptions made if any, should be stated clearly at the beginning of your answer.

 

Q1. When you decide to implement your own Bloom filter, you need to understand the main formulas relating important parameters impacting the design of Bloom filter, so that you can optimally configure the Bloom filter. Consider the following notation for the four parameters of the Bloom filter:

·         f = the false positive rate

·         m = number of bits in a Bloom filter

·         n = number of elements to insert

·         k = number of hash functions

The formula that determines the false positive rate as a function of other three parameters is as follows (Formula 1):

a)      For each of the following pair of bits-per-element value and number of hash functions, compute the value of “f”. Show all the necessary calculations. [10]

·         Bits-per-element: 5,6,8,10

·         Number of hash functions: 1 to 10

 

k

m/n

1

2

3

4

5

6

7

8

9

10

5

F?

F?

F?

F?

F?

F?

F?

F?

F?

F?

6

F?

F?

F?

F?

F?

F?

F?

F?

F?

F?

8

F?

F?

F?

F?

F?

F?

F?

F?

F?

F?

10

F?

F?

F?

F?

F?

F?

F?

F?

F?

F?

b)            Plot the graph of “f” against Bits-per-element and number of hash functions. [2]

c)            What is impact of change in Bits-per-element on the false positive rate? [1]

d)            Is there any relevant relationship that exhibit between the number of hash functions and false positive rate? [1]

e)            If the optimal k for a particular bits-per-element is given by following formula, then for Bits-per-element value of 7, what is optimal number of hash functions are required? [1]

Q2. The weighted moving average (WMA) is generalization of the standard moving average that uses different weights for each of the elements in the window. This collection of weights is known as “kernels”. Consider the following implementation of WAM algorithm.

 

Public class WMA {

            Double [] kernel;

            Double [] values;

            Double kernelSum = 0;

            Int k = 0;

            Long N = 0;

 

            Public WMA (double [] kernel) {

                        this.kernel = kernel;

                        for ( double j : kernel) kernelSum += j;

                        values = new double[kernel.length];

            }

            Public double observe (double x) {

                        Values [k++] = x;

                        If ( k == values.length ) k = 0;

                        N++;

                        If ( N < kernel.length) return Double.NaN;

                        Double y = 0;

                        For ( int i = 0; i < kernel.length; i++)

                                    Y += kernel[i] * values [ (k+i) % values.length];

                        return y/ kernelSum;

            }

}

Assume the kernel weights are given as 1, 2, 3, and 1.

 

a)      Compute the WMA for each of the following “x” values. Show all the necessary calculations. [7]

x = 1, 2, 3, 4, 5, 6, 7, 8

b)      Discuss the impact of this algorithm in off-line and online processing environments. [1]

 

Q3. Consider the following stream of events coming from a truck. Periodically these events are received and processed on the server side for doing some rum time analytics as shown in the query below.

 

Event Stream data

 

ID

Event

Processing Time

Status

Qty

Time

T1

11.2

12

Moving

2

T2

11.15

12

Moving

3

T3

11.09

12

Moving

1

T1

11.5

12

Moving

2

T2

11.45

12

Static

3

T3

11.39

12

Broken

1

T4

11.19

12

Moving

2

T1

12.2

1

Moving

2

T2

12.15

1

Static

3

T3

12.09

1

Broken

1

T4

11.49

1

Moving

2

T1

12.5

1

Broken

2

T2

12.45

1

Moving

3

T3

12.39

1

Static

1

T4

12.37

1

Moving

2

 

 

The structured streaming Query –

inputDataFrame.groupBy(Status).window(120 minutes).count(Qty)

 

Showcase the content of following tables when the query is executed at 12.00 and 1.00 PM respectively.                                                                                                                                 [7]

a)      Input table

b)      Result table

c)      Output when mode is respectively

                                i.            Complete

                              ii.            Update

                            iii.            Append

 

 

Q.4. Look at following stream of data values with time stamp.             [1 + 2 + 1 + 1 + 3 + 2 = 10]

 

Value

34

67

-6

78

34

12

90

45

12

time

0

0

1

1

2

2

2

3

3

 

(For a, b, c) If a count based tumbling window is defined with eviction policy set to 4,

a)      How many windows will be processed for above stream of data values?

b)      What will be the difference between the first and last data value in the second window?

c)      If the trigger policy is set to 2, then how many times the code will be executed for query?

(For d, e) If a time based tumbling window is defined,

d)      With eviction policy set to 3 seconds, how many windows will be processed for the above stream of data values?

e)      If the trigger policy and eviction policy both are set to 1 second, what will be the average values (for each window)?

f)       If a sliding window is defined, with slide interval 1 second and window length 1 seconds, what are the different windows that will be visible for the given streaming data values?

 

***********

 



---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.