Showing posts with label Search. Show all posts
Showing posts with label Search. Show all posts

Tuesday, April 4, 2023

Recommender Systems

Long tail effect :  Many popular products are found in both Retail / Online 
                    less popular products are found only online  

---------------------------------
Modelling Recommender Systems 

U = {USERS}
I = {ITEMS}
F is a utility function, measures the usefulness of items I to user U.
F:UxI -> R where R is the rating 

--------------------------------

Characteristics of recommendation system



Types of recommender system 





SVD - Singular value decomposition







---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Sunday, April 2, 2023

Web Crawler

Web Crawler : It is a software for  downloading pages from the Web. Also known as Web Spider, Web Robot,  or simply Bot.

Web Crawler Applications

1.Create an index covering broad topics (General  Web search )
2.Create an index covering specific topics (Vertical  Web search )
3.Archive content :(Web archival, URL: http://www.archive.org/ )
4.Analyze Web sites for extracting aggregate  statistics (Web characterization )
5.Keep copies or replicate Web sites  (Web mirroring-daily or weekly)
6.Web site analysis (broken links, site not available)

Crawler Taxonomy




Basic Web Crawler Architecture
 3 components in Web Crawler

Scheduler - Maintains a queue of URLS to Visit 
downloader - downloads the pages
Storage - makes indexing of pages and provides scheduler with metadata on the pages retrieved


Crawling complications 

1.Malicious Pages - spam pages & spider traps (crawler traps)
2.Non-malicious pages - 
      Latency / Bandwidth  to remote servers vary 
      Webmasters stipulations -- how deep one has to crawl in a website.
3.Site mirrors and duplicate pages 
4.Politness --> how frequently we should hit the server
         Explicit politeness: specifications from  webmasters on what portions of site  can be crawled
                                            robots.txt
        Implicit politeness: even with no  specification, avoid hitting any site too  often
 
--------------------------------------------------------------------------------------------------------------------- 
  
Crawler should be distributed / scalable / performance & efficiency 
                   fetch higher quality pages first
   continuous operation -- fetch fresh copies of previous pages 
   Extensible : Adapt to new data formats.
 
robots.txt --> avoids overloading of the site.

----------------------------------------------------------------------------------------------------------------------
URL frontier





1.The URL frontier is the data structure that holds and  manages URLs we’ve seen, but that have not been  crawled yet.
2.Can include multiple pages from the same host
3.Must avoid trying to fetch them all at the same time
4.Must keep all crawling threads busy

Considerations 

Politeness: do not hit a web server too  frequently
Freshness: crawl some pages more often  than others


-------------------------------------------------------------------------------------------------------------
Basic crawl architecture 


-------------------------------------------------------------------------------------------------------------



---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Friday, March 31, 2023

Websearch

 Search engine

 1. Crawler Based 
 2. Directory 
 3. Metasearch 
 
Search Types

 1. General Search / Horizontal Search : ex : google --> results are very broad and results might not be relevant sometimes.
 2. Vertical search - very specific search or specific part of internet ; ex: google images / Amazon product search.
 



 
 Web challenges for IR 
 ---------------------
 1.Distributed Data: Documents spread over millions of  different web servers.
 2.Volatile Data: Many documents change or disappear  rapidly (e.g. dead links).
 3.Large Volume: Billions of separate documents.
 4.Unstructured and Redundant Data: No uniform  structure, HTML errors, up to 30% (near) duplicate  documents.
 5.Quality of Data: No editorial control, false information,  poor quality writing, typos, etc.
 6.Heterogeneous Data: Multiple media types (images,  video, VRML), languages, character sets, etc.
 
 
Modeling the Web
----------------

Heaps’ and Zipf’s laws are also valid in the Web.
»In particular, the vocabulary grows faster (larger) and the word
distribution should be more biased (larger)

Heaps’ Law
» An empirical rule which describes the vocabulary growth as a  function of the text size.
» It establishes that a text of n words has a vocabulary of size O(nš›½) for
0< š›½ <1

Zipf’s Law
» An empirical rule that describes the frequency of the text words.
» It states that the i-th most frequent word appears as many times as
the most frequent one divided by i š›½, for some š›½ >1


Different types of queries 

1. Informational queries : learn about something - 40%
2. Navigational queries : take to a page - 25%
3. Transactional queries : want to do something - 35%

Essential Characteristics for  user-friendliness of  a website 
1.Mobile Compatibility
2.Accessible to All Users
3.Well Planned Information Architecture
4.Well-Formatted Content That Is Easy to Scan
5.Fast Load Times
6.Browser Consistency
7.Effective Navigation
8.Good Error Handling
9.Contrasting Color Scheme
10.Usable forms


Centralized Architecture - Crawler-Indexer Architecture
important components
1.Crawler / spider 
2.Indexer 
3.Query Engine  




Indexing process
1. text acquisition 
2. text transformation
3. Index creation

Query Process
1.User Interaction
2.Ranking
3.Evaluations



Distributed Architecture - 
Harvest 
  Gathers and Brokers


 
  
User Interface 
 query interface
 Answer interface 


---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.