Showing posts with label Retrieval. Show all posts
Showing posts with label Retrieval. Show all posts

Wednesday, April 5, 2023

Link Analysis

Link analysis for most IR functionality thus far based purely on text

Scoring and ranking
Link-based clustering – topical structure from links
Links as features in classification – documents that link to one another are likely to be on the same subject

Assumptions
1.Reputed Sites
2.Annotation of Target


Page Rank 
 Scoring measure based on the link structure of web pages
 PageRank = long-term visit rate = steady state probability.
 
Markov chains
 Markov chain is a discrete-time stochastic process: a process that occurs in a series of time-steps in each of which a random choice is made.
 A Markov chain consists of N states, plus an N × N transition probability matrix P.
 state = page
 At each step, we are on exactly one of the pages.



Teleporting – to get us out of dead ends






HITS – Hyperlink-Induced Topic Search

Hubs:  A hub page is a good list of links to pages answering the information need
Authorities:  An authority page is a direct answer to the information need

A good hub page for a topic links to many authority pages for that topic
A good authority page for a topic is linked to by many hub pages for that topic




---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Tuesday, April 4, 2023

Recommender Systems

Long tail effect :  Many popular products are found in both Retail / Online 
                    less popular products are found only online  

---------------------------------
Modelling Recommender Systems 

U = {USERS}
I = {ITEMS}
F is a utility function, measures the usefulness of items I to user U.
F:UxI -> R where R is the rating 

--------------------------------

Characteristics of recommendation system



Types of recommender system 





SVD - Singular value decomposition







---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Sunday, April 2, 2023

Web Crawler

Web Crawler : It is a software for  downloading pages from the Web. Also known as Web Spider, Web Robot,  or simply Bot.

Web Crawler Applications

1.Create an index covering broad topics (General  Web search )
2.Create an index covering specific topics (Vertical  Web search )
3.Archive content :(Web archival, URL: http://www.archive.org/ )
4.Analyze Web sites for extracting aggregate  statistics (Web characterization )
5.Keep copies or replicate Web sites  (Web mirroring-daily or weekly)
6.Web site analysis (broken links, site not available)

Crawler Taxonomy




Basic Web Crawler Architecture
 3 components in Web Crawler

Scheduler - Maintains a queue of URLS to Visit 
downloader - downloads the pages
Storage - makes indexing of pages and provides scheduler with metadata on the pages retrieved


Crawling complications 

1.Malicious Pages - spam pages & spider traps (crawler traps)
2.Non-malicious pages - 
      Latency / Bandwidth  to remote servers vary 
      Webmasters stipulations -- how deep one has to crawl in a website.
3.Site mirrors and duplicate pages 
4.Politness --> how frequently we should hit the server
         Explicit politeness: specifications from  webmasters on what portions of site  can be crawled
                                            robots.txt
        Implicit politeness: even with no  specification, avoid hitting any site too  often
 
--------------------------------------------------------------------------------------------------------------------- 
  
Crawler should be distributed / scalable / performance & efficiency 
                   fetch higher quality pages first
   continuous operation -- fetch fresh copies of previous pages 
   Extensible : Adapt to new data formats.
 
robots.txt --> avoids overloading of the site.

----------------------------------------------------------------------------------------------------------------------
URL frontier





1.The URL frontier is the data structure that holds and  manages URLs we’ve seen, but that have not been  crawled yet.
2.Can include multiple pages from the same host
3.Must avoid trying to fetch them all at the same time
4.Must keep all crawling threads busy

Considerations 

Politeness: do not hit a web server too  frequently
Freshness: crawl some pages more often  than others


-------------------------------------------------------------------------------------------------------------
Basic crawl architecture 


-------------------------------------------------------------------------------------------------------------



---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Friday, March 31, 2023

Websearch

 Search engine

 1. Crawler Based 
 2. Directory 
 3. Metasearch 
 
Search Types

 1. General Search / Horizontal Search : ex : google --> results are very broad and results might not be relevant sometimes.
 2. Vertical search - very specific search or specific part of internet ; ex: google images / Amazon product search.
 



 
 Web challenges for IR 
 ---------------------
 1.Distributed Data: Documents spread over millions of  different web servers.
 2.Volatile Data: Many documents change or disappear  rapidly (e.g. dead links).
 3.Large Volume: Billions of separate documents.
 4.Unstructured and Redundant Data: No uniform  structure, HTML errors, up to 30% (near) duplicate  documents.
 5.Quality of Data: No editorial control, false information,  poor quality writing, typos, etc.
 6.Heterogeneous Data: Multiple media types (images,  video, VRML), languages, character sets, etc.
 
 
Modeling the Web
----------------

Heaps’ and Zipf’s laws are also valid in the Web.
»In particular, the vocabulary grows faster (larger) and the word
distribution should be more biased (larger)

Heaps’ Law
» An empirical rule which describes the vocabulary growth as a  function of the text size.
» It establishes that a text of n words has a vocabulary of size O(nš›½) for
0< š›½ <1

Zipf’s Law
» An empirical rule that describes the frequency of the text words.
» It states that the i-th most frequent word appears as many times as
the most frequent one divided by i š›½, for some š›½ >1


Different types of queries 

1. Informational queries : learn about something - 40%
2. Navigational queries : take to a page - 25%
3. Transactional queries : want to do something - 35%

Essential Characteristics for  user-friendliness of  a website 
1.Mobile Compatibility
2.Accessible to All Users
3.Well Planned Information Architecture
4.Well-Formatted Content That Is Easy to Scan
5.Fast Load Times
6.Browser Consistency
7.Effective Navigation
8.Good Error Handling
9.Contrasting Color Scheme
10.Usable forms


Centralized Architecture - Crawler-Indexer Architecture
important components
1.Crawler / spider 
2.Indexer 
3.Query Engine  




Indexing process
1. text acquisition 
2. text transformation
3. Index creation

Query Process
1.User Interaction
2.Ranking
3.Evaluations



Distributed Architecture - 
Harvest 
  Gathers and Brokers


 
  
User Interface 
 query interface
 Answer interface 


---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Ranked Retrieval Evaluation - Rank Based Measures - Non- Binary relevance

Non- Binary relevance

Documents are rarely entirely relevant or non-relevant  to a query
Many sources of graded relevance judgments
Relevance judgments on a 5-point scale
Multiple judges
Click distribution and deviation from expected levels
(but click-through != relevance judgments)





Normalized Discounted  Cumulative Gain (NDCG)















---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. 
Host/author is not responsible for these posts.

Ranked Retrieval Evaluation - Rank Based Measures - Binary relevance

Binary relevance -->  we are only worried if it is relevant or not 


Precision@K (P@K) -- only top documents ie compute  % relevant documents are picked in K


Practical Example : Practical Example

-----------------------------------------------------------------
Mean Average Precision (MAP): 






-----------------------------------------------------------------------

Mean Reciprocal Rank (MRR)







---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.Host/author is not responsible for these posts.

Wednesday, March 29, 2023

Unranked Retrieval Evaluation- Precision and Recall based on documents and F-Measure

Unranked Retrival Information
important measures are 

Precision:The ability to retrieve top-ranked documents that are mostly relevant. 

Recall:The ability of the search to find all of the  relevant items in the corpus.




F-measure : 
1.Takes into  account both recall and precision.
2. Harmonic mean of recall and precision










----------------------------------------------------------------------------
 All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.
Host/author is not responsible for these posts.

TREC Benchmark / Measuring Relevance / Evaluating IR system

Measures for Search Engine
1. Latency of search
2. Expressiveness of query Language 
Ability to express complex information needs
Speed on complex queries

Measuring Releavance based on User 
Three elements:
1. A benchmark document collection
2. A benchmark suite of queries
3. An assessment of either Relevant or  Nonrelevant for each query and each  document

TREC Benchmark - Text REtrieval Conference 





Evaluating IR system -- need to verify if retrived document is relaveant or not

Difficulties in Evaluating IR Systems
1.Effectiveness is related to the relevancy of  retrieved items.
2.Relevancy is not typically binary but continuous.
3.Even if relevancy is binary, it can be a  difficult judgment to make.
Relevancy, from a human standpoint, is:
Subjective: Depends upon a specific user’s  judgment.
Situational: Relates to user’s current needs.
Cognitive: Depends on human perception and  behavior.
Dynamic: Changes over time.
Evaluating IR systems 
1. Gold Standard (Human Labeled Corpora) : Using Humans to create Gold standard - Manual


----------------------------------------------------------------------------
 All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Tuesday, January 24, 2023

Midsemester - Information Retrieval -- DSECLZG537 - Jan 7th 2023

Information Retrieval - Regular - Mid Semester conducted on 7th Jan 2023









---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Wednesday, January 4, 2023

Tf-IDf in information Retrieval

The tf-idf (term frequency-inverse document frequency) is a measure of the importance of a word in a document or a collection of documents. It is commonly used in information retrieval and natural language processing tasks.

The formula for calculating tf-idf is:

tf-idf = tf * idf

where:

  • tf (term frequency) is the frequency of the word in the document. It can be calculated as the number of times the word appears in the document divided by the total number of words in the document.

  • idf (inverse document frequency) is a measure of the rarity of the word. It can be calculated as the logarithm of the total number of documents divided by the number of documents that contain the word.

The resulting tf-idf score for a word reflects both the importance of the word in the specific document and its rarity in the collection of documents. Words that are common across all documents will have a lower tf-idf score, while words that are specific to a particular document and rare in the collection will have a higher tf-idf score.

----------------------------------

The tf-idf (term frequency-inverse document frequency) is a measure of the importance of a word in a document or a collection of documents. It is commonly used in information retrieval and natural language processing tasks.

The formula for calculating tf-idf is:

tf-idf = tf * idf

where:

  • tf (term frequency) is the frequency of the word in the document. It can be calculated as the number of times the word appears in the document divided by the total number of words in the document.

  • idf (inverse document frequency) is a measure of the rarity of the word. It can be calculated as the logarithm of the total number of documents divided by the number of documents that contain the word.

The resulting tf-idf score for a word reflects both the importance of the word in the specific document and its rarity in the collection of documents. Words that are common across all documents will have a lower tf-idf score, while words that are specific to a particular document and rare in the collection will have a higher tf-idf score.

----------------------




---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Jaccard Coefficient - Information Retrieval



Links : Example

----------------------------------------------------------------------------
 All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Minimum Edit Distance - Information Retrieval






Useful links : Link1
                      Execution2Intention Example



---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

K-Gram Index (Bigram Indexes) - Information Retrieval







K Gram example : Stanford

---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Tuesday, January 3, 2023

Permuterm index - Information Retrieval




Link : Stanford Link 

---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Difference between Word/term/token and type

Word – A delimited string of characters as it appears in the  text.

Term – A “normalized” word (case, morphology, spelling  etc); an equivalence class of words.
ex: Same word can be present multiple times, need to consider it all times.

Token – An instance of a word or term occurring in a document.
ex: only time we need to consider how many times the word occurs.

Type – The same as a term in most cases: an equivalence  class of tokens.

---------------------------------------------------------------------------- 
All the messages
below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Issues with Information Retrieval

Issues with Information Retrieval?

Information Retrieval deals with uncertainty and
vagueness in information systems.

Uncertainty: available representation does typically not  reflect true semantics/meaning of objects (text, images,  video, etc.)

Vagueness: information need of user lacks clarity, is only  vague expressed in query, feedback or user actions.

Differs conceptually from database queries!




---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Information Retrieval vs Data Retrieval -- Tabular form

 



---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Sunday, January 1, 2023

Information Retrieval -- DSECLZG537 - Mid Semester Question Paper - June 2021

 

Birla Institute of Technology & Science, Pilani

Work-Integrated Learning Programmes Division

June 2021

Mid-Semester Test

(EC-1 Regular)

Text Box: No. of Pages        = 2
No. of Questions = 2

 


Course No.                   : SS ZG537  

Course Title                  : INFORMATION RETRIEVAL  

Nature of Exam            : Closed Book

Weightage                    : 30%

 

Note:

1.       Please follow all the Instructions to Candidates given on the cover page of the answer book.

2.       All parts of a question should be answered consecutively. Each answer should start from a fresh page. 

3.       Assumptions made if any, should be stated clearly at the beginning of your answer.

 

Q1 – 2+5+3+5=15 Marks

A) Give an example of uncertainty and vagueness issues in Information retrieval [2 Marks]               

 

B) Explain the merge algorithm for the query “Information Retrieval”? What is the best order for query processing for the query “BITS AND Information AND Retrieval”? What Documents will be returned as output from the 15 documents? [5 Marks]



 


Solution:

Merge Algorithm - Intersecting two posting lists : Algorithm


Output document - 11

 

C) [3 Marks]

 

D) Build inverted index using Blocked sort-based Indexing for 50 million records. Explain the algorithm in detail with respect to indexing 50 million records.                            [5 Marks]

 

 

Q2 – 5+5+5=15 Marks

A)    Assume a corpus of 10000 documents.  The following table gives the TF and DF values for the 3 terms in the corpus of documents. Calculate the logarithmic TF-IDF values.                                                                                                           [5 Marks]

 

Term

Doc1

Doc2

Doc3

bits

15

5

20

pilani

2

20

0

mtech

0

20

15

 

Term

dft

 

bits

2000

pilani

1500

mtech

500

 

 

 

 

 

B) Classify the test document d6 into c1 or c2 using naĆÆve bayes classifier. The documents in the training set and the appropriate class label is given below.  [5 Marks]

                                                                                     

 

 

Docid

Words in document

c= c1

c= c2

Training Set

d1

positive

Yes

No

 

d2

Very positive

Yes

No

 

d3

Positive very positive 

Yes

No

 

d4

very negative

No

Yes

 

d5

negative

No

Yes

Test Set

d6

Negative positive very positive

?

?

 

C) The search engine ranked results on 0-5 relevance scale: 2, 2, 3, 0, 5. Calculate the NDCG metric for the same. [5 Marks]

                                                                                                                      

 





---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.