Monday, January 9, 2023

BITS-WILP-SPA-Regular 2020-Mid Semester

Birla Institute of Technology & Science, Pilani
Work Integrated Learning Programmes Division

Second Semester 2019-2020
Mid-Semester Test
(EC-2 Regular)

Course No. : DSECLZG556
Course Title : Stream Processing & Analytics
Nature of Exam : Closed Book
Weightage : 30%
Duration : 2 Hours
Date of Exam :

Note to Students:
1. Please follow all the Instructions to Candidates given on the cover page of the answer book.
2. All parts of a question should be answered consecutively. Each answer should start from a fresh page.
3. Assumptions made if any, should be stated clearly at the beginning of your answer.
----------------------------------------------------------------------------------------------------------------

Q.1. What are streaming data systems? Explain the Generalized Streaming Data architecture and
its various components? [6]

Q.2. For parliamentary elections vote counting updates, a system has been developed which can
be used by interested parties to receive the vote counting related updates. Each constituency is
divided into six blocks. Each block has several voting centers in it. Counting is done center wise
which approximately takes 30 minutes for each center. Once the counting for all the centers in a
block is done then the central system is notified about the latest state of votes received by various
candidates. Giving three reasons, Justify whether the above described system is case of streaming
data or not. [6]

Q.3 Compare the different streaming data delivery protocols with respect to the following points:
I. Message frequency
II. Communication direction
III. Message Latency
IV. Efficiency [6]

Q.4 Consider an international airline which operates both in passenger segment and cargo segment.
For every flight that is flying, the airline captures a lot of data in real time which can be used for
live tracking of flight status, modelling the flight schedules as well as for preventive maintenance
schedule etc. Also, at the same time, the same data was used for various analytical purposes which
are oriented towards improving the airline operations and also for the predicting the passenger
loads, cargo loads in near future and devising the marketing strategies around it. Identify the
appropriate data processing architecture that can help in achieving these use cases. With a help of
architectural diagram, represent the proposed system architecture. [6]
No. of Questions = 05


Q.5 A producer produces messages which are fed to a Kafka topic which has three partitions into
it. Another producer produces messages which are fed to the earlier mentioned Kafka topic as
well as into a different Kafka topic having two partitions into it. There are 6 Kafka brokers in the
system and 3 consumers out of which first two listens to the Kafka topic partitions of first topic
whereas the last one listens to the partitions of second Kafka topic. For each topic partition, 2
replicas are maintained in the cluster. Draw a suitable Kafka Cluster architectural diagram
fulfilling the above-mentioned requirements. [6]




---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Thursday, January 5, 2023

Hadoop 2 - Architecture







---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

MapReduce Programming Architecture and flow

1.Input dataset is split into multiple pieces of data (several small sets)
2.Framework creates a master and several worker processes and executes the worker processes remotely
3.Several Map tasks work simultaneously and read pieces of data that were assigned to each map. Map worker uses the
map function to extract only those data that are present on their server and generates key/value pair for the extracted
data.
4.Map worker uses partitioner function to divide the data into regions. Partitioner decides which reducer should get the
output of specified mapper.
5.When the map workers complete their work, the master instructs the reduce workers to begin their work.
6.The reduce workers in turn contact the map workers to get the key/value data for their partition (shuffle). The data thus
received from various mappers is merge sorted as per keys.
7.Then it calls reduce function on every unique key. This function writes output to the file.
8.When all the reduce workers complete their work, the master transfers the control to the user program.






---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Big Data Analytics Lifecycle






---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Mean Time Failure and other Formulas

Important formula for Mean Time Failures




MTTF - Mean Time To Failure
 MTTF = 1 / failure rate = Total #hours of operation / Total #units
 MTTF is an averaged value. In reality failure rate changes over time
because it may depend on age of component.

 Failure rate = 1 / MTTF (assuming average value over time)

 MTTR - Mean Time to Recovery / Repair
 MTTR = Total #hours for maintenance / Total #repairs

 MTTD - Mean Time to Diagnose

 MTBF - Mean Time Between Failures
 MTBF = MTTD + MTTR + MTTF


MTTF - Serial assembly 

 MTTF of system = 1 / SUM (1/MTTFi) for all components i
 Failure rate of system = SUM(1/MTTFi) for all components i

MTTF - Parallel assembly 

MTTF of system = SUM(MTTFi) for all components i

 Availability = Time system is UP and accessible / Total time observed

 Availability = MTTF / (MTTD* + MTTR + MTTF)
or
 Availability = MTTF / MTBF

---------------------------------------------------------------------------- 
All the messages  below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Wednesday, January 4, 2023

Cache performance and Access time of memories

• Cache hit
✓ When CPU refers to memory and find the data or instruction within the Cache Memory
• Cache miss
✓ If the desired data or instruction is not found in the cache memory and CPU refers to the
main memory to find that data or instruction
Hit + Miss = Total CPU Reference
Hit Ratio h = Hit / ( Hit + Miss )

Average access time of any memory system consists of two levels:
✓ Cache Memory
✓ Main Memory
• If Tc is time to access cache memory and Tm is the time to access main
memory and h is the cache hit ration, then
Tavg = Average time to access memory
Tavg = h * Tc + ( 1-h ) * ( Tm + Tc )

---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.
Host/author is not responsible for these posts.

Big Data Architecture Challenges








---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Apache Technology Ecosystem




---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Big Data architecture style





---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Tf-IDf in information Retrieval

The tf-idf (term frequency-inverse document frequency) is a measure of the importance of a word in a document or a collection of documents. It is commonly used in information retrieval and natural language processing tasks.

The formula for calculating tf-idf is:

tf-idf = tf * idf

where:

  • tf (term frequency) is the frequency of the word in the document. It can be calculated as the number of times the word appears in the document divided by the total number of words in the document.

  • idf (inverse document frequency) is a measure of the rarity of the word. It can be calculated as the logarithm of the total number of documents divided by the number of documents that contain the word.

The resulting tf-idf score for a word reflects both the importance of the word in the specific document and its rarity in the collection of documents. Words that are common across all documents will have a lower tf-idf score, while words that are specific to a particular document and rare in the collection will have a higher tf-idf score.

----------------------------------

The tf-idf (term frequency-inverse document frequency) is a measure of the importance of a word in a document or a collection of documents. It is commonly used in information retrieval and natural language processing tasks.

The formula for calculating tf-idf is:

tf-idf = tf * idf

where:

  • tf (term frequency) is the frequency of the word in the document. It can be calculated as the number of times the word appears in the document divided by the total number of words in the document.

  • idf (inverse document frequency) is a measure of the rarity of the word. It can be calculated as the logarithm of the total number of documents divided by the number of documents that contain the word.

The resulting tf-idf score for a word reflects both the importance of the word in the specific document and its rarity in the collection of documents. Words that are common across all documents will have a lower tf-idf score, while words that are specific to a particular document and rare in the collection will have a higher tf-idf score.

----------------------




---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Jaccard Coefficient - Information Retrieval



Links : Example

----------------------------------------------------------------------------
 All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Minimum Edit Distance - Information Retrieval






Useful links : Link1
                      Execution2Intention Example



---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

K-Gram Index (Bigram Indexes) - Information Retrieval







K Gram example : Stanford

---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Tuesday, January 3, 2023

Permuterm index - Information Retrieval




Link : Stanford Link 

---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Difference between Word/term/token and type

Word – A delimited string of characters as it appears in the  text.

Term – A “normalized” word (case, morphology, spelling  etc); an equivalence class of words.
ex: Same word can be present multiple times, need to consider it all times.

Token – An instance of a word or term occurring in a document.
ex: only time we need to consider how many times the word occurs.

Type – The same as a term in most cases: an equivalence  class of tokens.

---------------------------------------------------------------------------- 
All the messages
below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Issues with Information Retrieval

Issues with Information Retrieval?

Information Retrieval deals with uncertainty and
vagueness in information systems.

Uncertainty: available representation does typically not  reflect true semantics/meaning of objects (text, images,  video, etc.)

Vagueness: information need of user lacks clarity, is only  vague expressed in query, feedback or user actions.

Differs conceptually from database queries!




---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Information Retrieval vs Data Retrieval -- Tabular form

 



---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Monday, January 2, 2023

Deep Learning - Mid Semester - Makeup - DSECLZG524



























---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Sunday, January 1, 2023

Information Retrieval -- DSECLZG537 - Mid Semester Question Paper - June 2021

 

Birla Institute of Technology & Science, Pilani

Work-Integrated Learning Programmes Division

June 2021

Mid-Semester Test

(EC-1 Regular)

Text Box: No. of Pages        = 2
No. of Questions = 2

 


Course No.                   : SS ZG537  

Course Title                  : INFORMATION RETRIEVAL  

Nature of Exam            : Closed Book

Weightage                    : 30%

 

Note:

1.       Please follow all the Instructions to Candidates given on the cover page of the answer book.

2.       All parts of a question should be answered consecutively. Each answer should start from a fresh page. 

3.       Assumptions made if any, should be stated clearly at the beginning of your answer.

 

Q1 – 2+5+3+5=15 Marks

A) Give an example of uncertainty and vagueness issues in Information retrieval [2 Marks]               

 

B) Explain the merge algorithm for the query “Information Retrieval”? What is the best order for query processing for the query “BITS AND Information AND Retrieval”? What Documents will be returned as output from the 15 documents? [5 Marks]



 


Solution:

Merge Algorithm - Intersecting two posting lists : Algorithm


Output document - 11

 

C) [3 Marks]

 

D) Build inverted index using Blocked sort-based Indexing for 50 million records. Explain the algorithm in detail with respect to indexing 50 million records.                            [5 Marks]

 

 

Q2 – 5+5+5=15 Marks

A)    Assume a corpus of 10000 documents.  The following table gives the TF and DF values for the 3 terms in the corpus of documents. Calculate the logarithmic TF-IDF values.                                                                                                           [5 Marks]

 

Term

Doc1

Doc2

Doc3

bits

15

5

20

pilani

2

20

0

mtech

0

20

15

 

Term

dft

 

bits

2000

pilani

1500

mtech

500

 

 

 

 

 

B) Classify the test document d6 into c1 or c2 using naïve bayes classifier. The documents in the training set and the appropriate class label is given below.  [5 Marks]

                                                                                     

 

 

Docid

Words in document

c= c1

c= c2

Training Set

d1

positive

Yes

No

 

d2

Very positive

Yes

No

 

d3

Positive very positive 

Yes

No

 

d4

very negative

No

Yes

 

d5

negative

No

Yes

Test Set

d6

Negative positive very positive

?

?

 

C) The search engine ranked results on 0-5 relevance scale: 2, 2, 3, 0, 5. Calculate the NDCG metric for the same. [5 Marks]

                                                                                                                      

 





---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.