forwards: WILP

Showing posts with label WILP. Show all posts

Tuesday, April 11, 2023

Decaying Window

This algorithm allows you to identify the most popular elements (trending, in other words) in an incoming data stream.

The decaying window algorithm not only tracks the most recurring elements in an incoming data stream, but also discounts any random spikes or spam requests that might have boosted an element’s frequency. In a decaying window, you assign a score or weight to every element of the incoming data stream. Further, you need to calculate the aggregate sum for each distinct element by adding all the weights assigned to that element. The element with the highest total score is listed as trending or the most popular.

Assign each element with a weight/score.
Calculate aggregate sum for each distinct element by adding all the weights assigned to that element.

In a decaying window algorithm, you assign more weight to newer elements. For a new element, you first reduce the weight of all the existing elements by a constant factor k and then assign the new element with a specific weight. The aggregate sum of the decaying exponential weights can be calculated using the following formula:

∑t−1i=0at−i(1−c)i

Here, c is usually a small constant of the order

10^{- 6}

10^{- 9}

. Whenever a new element, say at+1 , arrives in the data stream you perform the following steps to achieve an updated sum:

1. Multiply the current sum/score by the value (1−c).

2. Add the weight corresponding to the new element.

Generally i is considered as 1

Weight decays exponentially over time

In a data stream consisting of various elements, you maintain a separate sum for each distinct element. For every incoming element, you multiply the sum of all the existing elements by a value of (1−c). Further, you add the weight of the incoming element to its corresponding aggregate sum.

A threshold can be kept to, ignore elements of weight lesser than that.

Finally, the element with the highest aggregate score is listed as the most popular element.

Example

For example, consider a sequence of twitter tags below:
fifa, ipl, fifa, ipl, ipl, ipl, fifa

Also, let's say each element in sequence has weight of 1.
Let's c be 0.1
The aggregate sum of each tag in the end of above stream will be calculated as below:

fifa

fifa - 1 * (1-0.1) = 0.9

ipl - 0.9 * (1-0.1) + 0 = 0.81 (adding 0 because current tag is different than fifa)

fifa - 0.81 * (1-0.1) + 1 = 1.729 (adding 1 because current tag is fifa only)

ipl - 1.729 * (1-0.1) + 0 = 1.5561

ipl - 1.5561 * (1-0.1) + 0 = 1.4005

ipl - 1.4005 * (1-0.1) + 0 = 1.2605
fifa - 1.2605 * (1-0.1) + 1 = 2.135

ipl

fifa - 0 * (1-0.1) = 0

ipl - 0 * (1-0.1) + 1 = 1

fifa - 1 * (1-0.1) + 0 = 0.9 (adding 0 because current tag is different than ipl)

ipl - 0.9 * (1-0.01) + 1 = 1.81

ipl - 1.81 * (1-0.01) + 1 = 2.7919

ipl - 2.7919 * (1-0.01) + 1 = 3.764
fifa - 3.764 * (1-0.01) + 0 = 3.7264

In the end of the sequence, we can see the score of fifa is 2.135 but ipl is 3.7264
So, ipl is more trending then fifa
Even though both of them occurred same number of times in input there score is still different.

Advantages of Decaying Window Algorithm:

Sudden spikes or spam data is taken care.
New element is given more weight by this mechanism, to achieve right trending output.

Similar Blog : Nitin Decaying Window

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Sampling + Reservoir Sampling Algorithm + Hot List Problem + Frequent Algorithm +Space Saving Algorithm

Sampling : subset of data to be analyzed such that it has approximately the same properties of the original data

Random Sampling:Each element has equal probability of being selected

Reservior Sampling : A technique to maintain online random sample

Reservior Sampling Algorithm

1.Used to maintain an online random sample

2.Maintains a sample of size k, known as reservoir

3.Every new element has a probability k/n of replacing an old element in the reservoir

4.Let us say, k=1: An item in a stream of length n replaces the element in the reservoir with probability 1/n

Hot List Problem

Goal to Achieve

1.continuously maintain a list of the top-k most frequent elements in a stream

2.rank the items and absolute count is immaterial

Hot List Problem + Frequent Algorithm

Hot List Problem - Space Saving Algorithm

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Saturday, April 8, 2023

Cassandra

Cassandra is a NoSQL database for write-heavy workload and eventual consistency

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.

Host/author is not responsible for these posts.

Wednesday, April 5, 2023

Link Analysis

Link analysis for most IR functionality thus far based purely on text

Scoring and ranking

Link-based clustering – topical structure from links

Links as features in classification – documents that link to one another are likely to be on the same subject

Assumptions

1.Reputed Sites

2.Annotation of Target

Page Rank

Scoring measure based on the link structure of web pages

PageRank = long-term visit rate = steady state probability.

Markov chains

Markov chain is a discrete-time stochastic process: a process that occurs in a series of time-steps in each of which a random choice is made.

A Markov chain consists of N states, plus an N × N transition probability matrix P.

state = page

At each step, we are on exactly one of the pages.

Teleporting – to get us out of dead ends

Example : https://nlp.stanford.edu/IR-book/html/htmledition/the-pagerank-computation-1.html

Calculator : https://www.statskingdom.com/markov-chain-calculator.html

HITS – Hyperlink-Induced Topic Search

Hubs: A hub page is a good list of links to pages answering the information need

Authorities: An authority page is a direct answer to the information need

A good hub page for a topic links to many authority pages for that topic

A good authority page for a topic is linked to by many hub pages for that topic

Calculator : http://www.cs.sjsu.edu/faculty/pollett/masters/Semesters/Spring10/amith/HITS.html

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Tuesday, April 4, 2023

Recommender Systems

Long tail effect : Many popular products are found in both Retail / Online

less popular products are found only online

---------------------------------
Modelling Recommender Systems

U = {USERS}

I = {ITEMS}

F is a utility function, measures the usefulness of items I to user U.

F:UxI -> R where R is the rating

--------------------------------

Characteristics of recommendation system

Types of recommender system

Jaccord similarity

http://emailforwardsindia.blogspot.com/search/label/Jaccard

Cosine Similarity

SVD - Singular value decomposition

https://atozmath.com/MatrixEv.aspx?q=svd

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Sunday, April 2, 2023

Web Crawler

Web Crawler : It is a software for downloading pages from the Web. Also known as Web Spider, Web Robot, or simply Bot.

Web Crawler Applications

1.Create an index covering broad topics (General Web search )

2.Create an index covering specific topics (Vertical Web search )

3.Archive content :(Web archival, URL: http://www.archive.org/ )

4.Analyze Web sites for extracting aggregate statistics (Web characterization )

5.Keep copies or replicate Web sites (Web mirroring-daily or weekly)

6.Web site analysis (broken links, site not available)

Crawler Taxonomy

Basic Web Crawler Architecture

3 components in Web Crawler

Scheduler - Maintains a queue of URLS to Visit

downloader - downloads the pages

Storage - makes indexing of pages and provides scheduler with metadata on the pages retrieved

Crawling complications

1.Malicious Pages - spam pages & spider traps (crawler traps)

2.Non-malicious pages -

Latency / Bandwidth to remote servers vary

Webmasters stipulations -- how deep one has to crawl in a website.

3.Site mirrors and duplicate pages

4.Politness --> how frequently we should hit the server

Explicit politeness: specifications from webmasters on what portions of site can be crawled

robots.txt

Implicit politeness: even with no specification, avoid hitting any site too often

---------------------------------------------------------------------------------------------------------------------

Crawler should be distributed / scalable / performance & efficiency

fetch higher quality pages first

continuous operation -- fetch fresh copies of previous pages

Extensible : Adapt to new data formats.

robots.txt --> avoids overloading of the site.

----------------------------------------------------------------------------------------------------------------------

URL frontier

1.The URL frontier is the data structure that holds and manages URLs we’ve seen, but that have not been crawled yet.

2.Can include multiple pages from the same host

3.Must avoid trying to fetch them all at the same time

4.Must keep all crawling threads busy

Considerations

Politeness: do not hit a web server too frequently

Freshness: crawl some pages more often than others

-------------------------------------------------------------------------------------------------------------

Basic crawl architecture

-------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Friday, March 31, 2023

Websearch

Search engine

1. Crawler Based

2. Directory

3. Metasearch

Search Types

1. General Search / Horizontal Search : ex : google --> results are very broad and results might not be relevant sometimes.

2. Vertical search - very specific search or specific part of internet ; ex: google images / Amazon product search.

Web challenges for IR

---------------------

1.Distributed Data: Documents spread over millions of different web servers.

2.Volatile Data: Many documents change or disappear rapidly (e.g. dead links).

3.Large Volume: Billions of separate documents.

4.Unstructured and Redundant Data: No uniform structure, HTML errors, up to 30% (near) duplicate documents.

5.Quality of Data: No editorial control, false information, poor quality writing, typos, etc.

6.Heterogeneous Data: Multiple media types (images, video, VRML), languages, character sets, etc.

Modeling the Web

----------------

Heaps’ and Zipf’s laws are also valid in the Web.

»In particular, the vocabulary grows faster (larger) and the word

distribution should be more biased (larger)

Heaps’ Law

» An empirical rule which describes the vocabulary growth as a function of the text size.

» It establishes that a text of n words has a vocabulary of size O(n𝛽) for

0< 𝛽 <1

Zipf’s Law

» An empirical rule that describes the frequency of the text words.

» It states that the i-th most frequent word appears as many times as

the most frequent one divided by i 𝛽, for some 𝛽 >1

Different types of queries

1. Informational queries : learn about something - 40%

2. Navigational queries : take to a page - 25%

3. Transactional queries : want to do something - 35%

Essential Characteristics for user-friendliness of a website

1.Mobile Compatibility

2.Accessible to All Users

3.Well Planned Information Architecture

4.Well-Formatted Content That Is Easy to Scan

5.Fast Load Times

6.Browser Consistency

7.Effective Navigation

8.Good Error Handling

9.Contrasting Color Scheme

10.Usable forms

Centralized Architecture - Crawler-Indexer Architecture

important components

1.Crawler / spider

2.Indexer

3.Query Engine

Indexing process

1. text acquisition

2. text transformation

3. Index creation

Query Process

1.User Interaction

2.Ranking

3.Evaluations

Distributed Architecture -

Harvest

Gathers and Brokers

User Interface

query interface

Answer interface

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Ranked Retrieval Evaluation - Rank Based Measures - Non- Binary relevance

Non- Binary relevance

Documents are rarely entirely relevant or non-relevant to a query

Many sources of graded relevance judgments

Relevance judgments on a 5-point scale

Multiple judges

Click distribution and deviation from expected levels

(but click-through != relevance judgments)

Normalized Discounted Cumulative Gain (NDCG)

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.

Host/author is not responsible for these posts.

Ranked Retrieval Evaluation - Rank Based Measures - Binary relevance

Binary relevance --> we are only worried if it is relevant or not

Precision@K (P@K) -- only top documents ie compute % relevant documents are picked in K

Practical Example : Practical Example

-----------------------------------------------------------------

Mean Average Precision (MAP):

-----------------------------------------------------------------------

Mean Reciprocal Rank (MRR)

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.Host/author is not responsible for these posts.

Wednesday, March 29, 2023

Unranked Retrieval Evaluation- Precision and Recall based on documents and F-Measure

Unranked Retrival Information

important measures are

Precision:The ability to retrieve top-ranked documents that are mostly relevant.

Recall:The ability of the search to find all of the relevant items in the corpus.

F-measure :

1.Takes into account both recall and precision.

2. Harmonic mean of recall and precision

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.

Host/author is not responsible for these posts.

TREC Benchmark / Measuring Relevance / Evaluating IR system

Measures for Search Engine

1. Latency of search

2. Expressiveness of query Language

Ability to express complex information needs

Speed on complex queries

Measuring Releavance based on User

Three elements:

1. A benchmark document collection

2. A benchmark suite of queries

3. An assessment of either Relevant or Nonrelevant for each query and each document

TREC Benchmark - Text REtrieval Conference

Evaluating IR system -- need to verify if retrived document is relaveant or not

Difficulties in Evaluating IR Systems

1.Effectiveness is related to the relevancy of retrieved items.

2.Relevancy is not typically binary but continuous.

3.Even if relevancy is binary, it can be a difficult judgment to make.

Relevancy, from a human standpoint, is:

Subjective: Depends upon a specific user’s judgment.

Situational: Relates to user’s current needs.

Cognitive: Depends on human perception and behavior.

Dynamic: Changes over time.

Evaluating IR systems

1. Gold Standard (Human Labeled Corpora) : Using Humans to create Gold standard - Manual

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Tuesday, March 28, 2023

Anu's NLP Session on BITS WILP Pre Mid Semester Topic - 2 - DSECLZG525

Below vedio has the complete numerical explanation for NLP topics

1.Grammars and Parsing

2.A Top Down Parser

3. A Bottom- Up Chart Parser

4.Parsing Techniques

5.Context Free Grammars

5. Probabilistic CKY Parsing of PCFGs

6.Problems with PCFGs.

Session 1 video : Session 1

Anu Garg linkedin : https://www.linkedin.com/in/anu-garg-9ab13962

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.Host/author is not responsible for these posts.

Anu's NLP Session on BITS WILP Pre Mid Semester Topic - 1 - DSECLZG525

Below Video has complete numerical explanation for NLP topics like

1.Natural Language Understanding and Generation

2.N-gram Language Modelling

3.Part-of-Speech Tagging

4.Hidden Markov Model Algorithms

5.The Forward Algorithm

6.The Viterbi Algorithm

7.The Forward-Backward Algorithm

8.Maximum Entropy Markov Model

9. Laplace smoothing.

Part 2 Session Link : Session 2

Anu Garg's Linked in : https://www.linkedin.com/in/anu-garg-9ab13962/

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.

Host/author is not responsible for these posts.

Saturday, February 18, 2023

Makeup - Mid Semester- Questions and answers - Deep Learning - DSECLZG524 - Jan 2023

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

forwards

Tuesday, April 11, 2023

Decaying Window

Example

fifa

ipl

Advantages of Decaying Window Algorithm:

Sampling + Reservoir Sampling Algorithm + Hot List Problem + Frequent Algorithm +Space Saving Algorithm

Saturday, April 8, 2023

Cassandra

Wednesday, April 5, 2023

Link Analysis

Tuesday, April 4, 2023

Recommender Systems

Sunday, April 2, 2023

Web Crawler

Friday, March 31, 2023

Websearch

Ranked Retrieval Evaluation - Rank Based Measures - Non- Binary relevance

Ranked Retrieval Evaluation - Rank Based Measures - Binary relevance

Wednesday, March 29, 2023

Unranked Retrieval Evaluation- Precision and Recall based on documents and F-Measure

TREC Benchmark / Measuring Relevance / Evaluating IR system

Tuesday, March 28, 2023

Anu's NLP Session on BITS WILP Pre Mid Semester Topic - 2 - DSECLZG525

Anu's NLP Session on BITS WILP Pre Mid Semester Topic - 1 - DSECLZG525

Saturday, February 18, 2023

Makeup - Mid Semester- Questions and answers - Deep Learning - DSECLZG524 - Jan 2023

Regular - Mid Semester - Deep Learning- questions and answers - DSECLZG524 - 7th Jan 2023

Recent Visitor