forwards

Thursday, January 5, 2023

Hadoop 2 - Architecture

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

MapReduce Programming Architecture and flow

1.Input dataset is split into multiple pieces of data (several small sets)

2.Framework creates a master and several worker processes and executes the worker processes remotely

3.Several Map tasks work simultaneously and read pieces of data that were assigned to each map. Map worker uses the

map function to extract only those data that are present on their server and generates key/value pair for the extracted

data.

4.Map worker uses partitioner function to divide the data into regions. Partitioner decides which reducer should get the

output of specified mapper.

5.When the map workers complete their work, the master instructs the reduce workers to begin their work.

6.The reduce workers in turn contact the map workers to get the key/value data for their partition (shuffle). The data thus

received from various mappers is merge sorted as per keys.

7.Then it calls reduce function on every unique key. This function writes output to the file.

8.When all the reduce workers complete their work, the master transfers the control to the user program.

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Big Data Analytics Lifecycle

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Mean Time Failure and other Formulas

Important formula for Mean Time Failures

MTTF - Mean Time To Failure

MTTF = 1 / failure rate = Total #hours of operation / Total #units

MTTF is an averaged value. In reality failure rate changes over time

because it may depend on age of component.

Failure rate = 1 / MTTF (assuming average value over time)

MTTR - Mean Time to Recovery / Repair

MTTR = Total #hours for maintenance / Total #repairs

MTTD - Mean Time to Diagnose

MTBF - Mean Time Between Failures

MTBF = MTTD + MTTR + MTTF

MTTF - Serial assembly

MTTF of system = 1 / SUM (1/MTTFi) for all components i

Failure rate of system = SUM(1/MTTFi) for all components i

MTTF - Parallel assembly

MTTF of system = SUM(MTTFi) for all components i

Availability = Time system is UP and accessible / Total time observed

Availability = MTTF / (MTTD* + MTTR + MTTF)

Availability = MTTF / MTBF

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Wednesday, January 4, 2023

Cache performance and Access time of memories

• Cache hit

✓ When CPU refers to memory and find the data or instruction within the Cache Memory

• Cache miss

✓ If the desired data or instruction is not found in the cache memory and CPU refers to the

main memory to find that data or instruction

Hit + Miss = Total CPU Reference
Hit Ratio h = Hit / ( Hit + Miss )

Average access time of any memory system consists of two levels:

✓ Cache Memory

✓ Main Memory

• If Tc is time to access cache memory and Tm is the time to access main

memory and h is the cache hit ration, then

Tavg = Average time to access memory
Tavg = h * Tc + ( 1-h ) * ( Tm + Tc )

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.

Host/author is not responsible for these posts.

Big Data Architecture Challenges

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Apache Technology Ecosystem

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Big Data architecture style

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Tf-IDf in information Retrieval

The tf-idf (term frequency-inverse document frequency) is a measure of the importance of a word in a document or a collection of documents. It is commonly used in information retrieval and natural language processing tasks.

The formula for calculating tf-idf is:

tf-idf = tf * idf

where:

tf (term frequency) is the frequency of the word in the document. It can be calculated as the number of times the word appears in the document divided by the total number of words in the document.
idf (inverse document frequency) is a measure of the rarity of the word. It can be calculated as the logarithm of the total number of documents divided by the number of documents that contain the word.

The resulting tf-idf score for a word reflects both the importance of the word in the specific document and its rarity in the collection of documents. Words that are common across all documents will have a lower tf-idf score, while words that are specific to a particular document and rare in the collection will have a higher tf-idf score.

----------------------------------

The formula for calculating tf-idf is:

tf-idf = tf * idf

where:

tf (term frequency) is the frequency of the word in the document. It can be calculated as the number of times the word appears in the document divided by the total number of words in the document.
idf (inverse document frequency) is a measure of the rarity of the word. It can be calculated as the logarithm of the total number of documents divided by the number of documents that contain the word.

----------------------

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Jaccard Coefficient - Information Retrieval

Links : Example

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Minimum Edit Distance - Information Retrieval

Useful links : Link1

Execution2Intention Example

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

K-Gram Index (Bigram Indexes) - Information Retrieval

K Gram example : Stanford

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Tuesday, January 3, 2023

Permuterm index - Information Retrieval

Link : Stanford Link

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Difference between Word/term/token and type

Word – A delimited string of characters as it appears in the text.

Term – A “normalized” word (case, morphology, spelling etc); an equivalence class of words.

ex: Same word can be present multiple times, need to consider it all times.

Token – An instance of a word or term occurring in a document.

ex: only time we need to consider how many times the word occurs.

Type – The same as a term in most cases: an equivalence class of tokens.

----------------------------------------------------------------------------

All the messages

below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Issues with Information Retrieval

Issues with Information Retrieval?

Information Retrieval deals with uncertainty and

vagueness in information systems.

Uncertainty: available representation does typically not reflect true semantics/meaning of objects (text, images, video, etc.)

Vagueness: information need of user lacks clarity, is only vague expressed in query, feedback or user actions.

Differs conceptually from database queries!

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Information Retrieval vs Data Retrieval -- Tabular form

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Monday, January 2, 2023

Deep Learning - Mid Semester - Makeup - DSECLZG524

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.