forwards: March 2023

Friday, March 31, 2023

Websearch

Search engine

1. Crawler Based

2. Directory

3. Metasearch

Search Types

1. General Search / Horizontal Search : ex : google --> results are very broad and results might not be relevant sometimes.

2. Vertical search - very specific search or specific part of internet ; ex: google images / Amazon product search.

Web challenges for IR

---------------------

1.Distributed Data: Documents spread over millions of different web servers.

2.Volatile Data: Many documents change or disappear rapidly (e.g. dead links).

3.Large Volume: Billions of separate documents.

4.Unstructured and Redundant Data: No uniform structure, HTML errors, up to 30% (near) duplicate documents.

5.Quality of Data: No editorial control, false information, poor quality writing, typos, etc.

6.Heterogeneous Data: Multiple media types (images, video, VRML), languages, character sets, etc.

Modeling the Web

----------------

Heaps’ and Zipf’s laws are also valid in the Web.

»In particular, the vocabulary grows faster (larger) and the word

distribution should be more biased (larger)

Heaps’ Law

» An empirical rule which describes the vocabulary growth as a function of the text size.

» It establishes that a text of n words has a vocabulary of size O(n𝛽) for

0< 𝛽 <1

Zipf’s Law

» An empirical rule that describes the frequency of the text words.

» It states that the i-th most frequent word appears as many times as

the most frequent one divided by i 𝛽, for some 𝛽 >1

Different types of queries

1. Informational queries : learn about something - 40%

2. Navigational queries : take to a page - 25%

3. Transactional queries : want to do something - 35%

Essential Characteristics for user-friendliness of a website

1.Mobile Compatibility

2.Accessible to All Users

3.Well Planned Information Architecture

4.Well-Formatted Content That Is Easy to Scan

5.Fast Load Times

6.Browser Consistency

7.Effective Navigation

8.Good Error Handling

9.Contrasting Color Scheme

10.Usable forms

Centralized Architecture - Crawler-Indexer Architecture

important components

1.Crawler / spider

2.Indexer

3.Query Engine

Indexing process

1. text acquisition

2. text transformation

3. Index creation

Query Process

1.User Interaction

2.Ranking

3.Evaluations

Distributed Architecture -

Harvest

Gathers and Brokers

User Interface

query interface

Answer interface

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Ranked Retrieval Evaluation - Rank Based Measures - Non- Binary relevance

Non- Binary relevance

Documents are rarely entirely relevant or non-relevant to a query

Many sources of graded relevance judgments

Relevance judgments on a 5-point scale

Multiple judges

Click distribution and deviation from expected levels

(but click-through != relevance judgments)

Normalized Discounted Cumulative Gain (NDCG)

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.

Host/author is not responsible for these posts.

Ranked Retrieval Evaluation - Rank Based Measures - Binary relevance

Binary relevance --> we are only worried if it is relevant or not

Precision@K (P@K) -- only top documents ie compute % relevant documents are picked in K

Practical Example : Practical Example

-----------------------------------------------------------------

Mean Average Precision (MAP):

-----------------------------------------------------------------------

Mean Reciprocal Rank (MRR)

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.Host/author is not responsible for these posts.

Wednesday, March 29, 2023

Unranked Retrieval Evaluation- Precision and Recall based on documents and F-Measure

Unranked Retrival Information

important measures are

Precision:The ability to retrieve top-ranked documents that are mostly relevant.

Recall:The ability of the search to find all of the relevant items in the corpus.

F-measure :

1.Takes into account both recall and precision.

2. Harmonic mean of recall and precision

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.

Host/author is not responsible for these posts.

TREC Benchmark / Measuring Relevance / Evaluating IR system

Measures for Search Engine

1. Latency of search

2. Expressiveness of query Language

Ability to express complex information needs

Speed on complex queries

Measuring Releavance based on User

Three elements:

1. A benchmark document collection

2. A benchmark suite of queries

3. An assessment of either Relevant or Nonrelevant for each query and each document

TREC Benchmark - Text REtrieval Conference

Evaluating IR system -- need to verify if retrived document is relaveant or not

Difficulties in Evaluating IR Systems

1.Effectiveness is related to the relevancy of retrieved items.

2.Relevancy is not typically binary but continuous.

3.Even if relevancy is binary, it can be a difficult judgment to make.

Relevancy, from a human standpoint, is:

Subjective: Depends upon a specific user’s judgment.

Situational: Relates to user’s current needs.

Cognitive: Depends on human perception and behavior.

Dynamic: Changes over time.

Evaluating IR systems

1. Gold Standard (Human Labeled Corpora) : Using Humans to create Gold standard - Manual

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Tuesday, March 28, 2023

Anu's NLP Session on BITS WILP Pre Mid Semester Topic - 2 - DSECLZG525

Below vedio has the complete numerical explanation for NLP topics

1.Grammars and Parsing

2.A Top Down Parser

3. A Bottom- Up Chart Parser

4.Parsing Techniques

5.Context Free Grammars

5. Probabilistic CKY Parsing of PCFGs

6.Problems with PCFGs.

Session 1 video : Session 1

Anu Garg linkedin : https://www.linkedin.com/in/anu-garg-9ab13962

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.Host/author is not responsible for these posts.

Anu's NLP Session on BITS WILP Pre Mid Semester Topic - 1 - DSECLZG525

Below Video has complete numerical explanation for NLP topics like

1.Natural Language Understanding and Generation

2.N-gram Language Modelling

3.Part-of-Speech Tagging

4.Hidden Markov Model Algorithms

5.The Forward Algorithm

6.The Viterbi Algorithm

7.The Forward-Backward Algorithm

8.Maximum Entropy Markov Model

9. Laplace smoothing.

Part 2 Session Link : Session 2

Anu Garg's Linked in : https://www.linkedin.com/in/anu-garg-9ab13962/

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.

Host/author is not responsible for these posts.

Thursday, March 9, 2023

INSTALLING HADOOP, HIVE IN WINDOWS & WORKING WITH HDFS -- Ramana (Along with recording)

########-----------INSTALLING HADOOP, HIVE IN WINDOWS & WORKING WITH HDFS -------------- ########

################ ------------ AKULA VENKATA RAMANA ------------ #################

Link to download Java JDK 8: https://www.oracle.com/java/technologies/downloads/#java8 (go for java version 1.8.0_261)

Link to download Hadoop(hadoop - 2.8.0): https://muhammadbilalyar.github.io/blogs/How-to-install-Hadoop-on-Window-10/

HIVE Installtion on WINDOWS:

https://www.youtube.com/watch?v=npyRXkMhrgk

Download Hive -

https://archive.apache.org/dist/hive/hive-2.1.0/

Download Derby Metastore-

https://archive.apache.org/dist/db/derby/db-derby-10.12.1.1/

Download hive-site.xml-

https://drive.google.com/file/d/1qqAo7RQfr5Q6O-GTom6Rji3TdufP81zd/view

# How to extract .tar.gz files on Windows 11

Run cmd as Admin:

tar -xvzf source file path -C destination path

# Format the HDFS filesystem: Run the following command in the admin command prompt to format the HDFS filesystem:

hdfs namenode -format

# Start Hadoop services: Run the following command in the command prompt to start the Hadoop services: start-dfs & start-yarn

start-dfs

start-yarn

# Verify installation: run the following command in the command prompt to verify that the Hadoop cluster is running: jps

C:\hadoop-2.8.0\sbin>jps

16592 ResourceManager

19296 DataNode

15096 Jps

18776 NameNode

27644 NodeManager

if we get all nodes & jps printed on console that means you have now successfully configured Hadoop in pseudo-distributed mode on Windows.

pseudo nodes means: In Hadoop, the pseudo-distributed mode is a single-node Hadoop cluster, where all the Hadoop daemons run on a single machine.

This mode is useful for testing and development purposes when we want to simulate a distributed Hadoop environment on a single machine.

In pseudo-distributed mode, all the Hadoop daemons such as NameNode, DataNode, ResourceManager, and NodeManager run in separate Java Virtual Machines (JVMs)

on the same machine. This means that we can use Hadoop commands and APIs to perform the same tasks as we would in a fully distributed mode cluster.

********************************************** Installation completed *****************************************************************************

What next? Once the Hadoop cluster running.....

# Start Derby > bin

derby-10.12.1.1

# Start the port:

startNetworkServer -h 0.0.0.0

to get the below connections on port 1527

Fri Mar 03 14:51:19 IST 2023 : Security manager installed using the Basic server security policy.

Fri Mar 03 14:51:19 IST 2023 : Apache Derby Network Server - 10.12.1.1 - (1704137) started and ready to accept connections on port 1527

# Start Hive: hive-2.1.0

# If you get this error: C:\hive-2.1.0\bin>hive

"Missing hadoop installation: C:\hadoop-2.8.0\bin must be set" then go to hadoop directory and set the below path:

set HADOOP_HOME=C:\hadoop-2.8.0

Now start Hive:

C:\hive-2.1.0\bin>hive

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/C:/hive-2.1.0/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/C:/hadoop-2.8.0/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

ERROR StatusLogger No log4j2 configuration file found. Using default configuration: logging only errors to the console.

Connecting to jdbc:hive2://

Connected to: Apache Hive (version 2.1.0)

Driver: Hive JDBC (version 2.1.0)

Transaction isolation: TRANSACTION_REPEATABLE_READ

Beeline version 2.1.0 by Apache Hive

hive>

# Show Databases / Dropping Tables / Dropping Databases etc

show databases;

use bdsassignment;

DROP TABLE ret_table_data;

DROP DATABASE bdsassignment;

# Creating a data base in Hive

create database bdsassignment;

# Use that database

use bdsassignment;

# Creating table in Hive

CREATE TABLE ret_table_data (

record_no INT,

invoice STRING,

stockcode STRING,

description STRING,

quantity INT,

invoicedate TIMESTAMP,

price DOUBLE,

customer_id INT,

country STRING

)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

WITH SERDEPROPERTIES (

"separatorChar" = ",",

"quoteChar" = "\"",

"escapeChar" = "\\"

)

STORED AS TEXTFILE

TBLPROPERTIES ("skip.header.line.count"="1");

# Now load the data intot the Hive table

load data local inpath 'C:\Assignment -2 2023 BDS DATA SET online_retail_data.csv' into table ret_table_data;

# To check if the data is propely loaded

select * from ret_table_data limit 20;

# For checking the Table description

describe formatted ret_table_data;

# Location where it is stored in hdfs: (check this in hadoop-2.8.0 command)

hdfs://localhost:9001/user/hive/warehouse/bdsassignment.db/ret_table_data

# To check the lehgth of the data set

select count(*) from ret_table_data;

# 2) Country from which the maximum revenue was collected from sales in the month of March 2010

SELECT country, SUM(quantity*price) AS revenue

FROM (

SELECT country, quantity, price,

substr(invoicedate, 1, 10) AS inv_date

FROM ret_table_data

WHERE ((invoicedate LIKE '%03-2010%') OR (invoicedate LIKE '%03/2010%'))

) t

GROUP BY country

ORDER BY revenue DESC

LIMIT 1;

# To monitor the Job status

http://localhost:8088

http://localhost:50070

# add the following line to your mapred-site.xml file, replacing the path with the correct path to your Python executable:

<name>mapreduce.executable</name>

<value>C:\Users\venka\anaconda3\pkgs\python-3.9.12-h6244533_0\python.exe</value>

</property>

# To leave the safemode

hdfs dfsadmin -safemode leave

# Run the MapReduce job using the following command: ensure to have py files of mapper & reducer on specified path:

# Now run the below command in Hadoop:

hadoop jar C:\hadoop-2.8.0\share\hadoop\tools\lib\hadoop-streaming-2.8.0.jar ^

-files "file:///C:/wordcoutmapper.py,file:///C:/wordcoutreducer.py" ^

-mapper " C:\Users\venka\anaconda3\pkgs\python-3.9.12-h6244533_0\python.exe wordcoutmapper.py" ^

-reducer "C:\Users\venka\anaconda3\pkgs\python-3.9.12-h6244533_0\python.exe wordcoutreducer.py" ^

-input /BDS_Assignment2/word_count_data.txt ^

-output /BDS_Assignment2/result

hadoop jar C:\hadoop-2.8.0\share\hadoop\tools\lib\hadoop-streaming-2.8.0.jar ^

-files "file:///C:/Q1mapper.py,file:///C:/Q1reducer.py" ^

-mapper " C:\Users\venka\anaconda3\pkgs\python-3.9.12-h6244533_0\python.exe Q1mapper.py" ^

-reducer "C:\Users\venka\anaconda3\pkgs\python-3.9.12-h6244533_0\python.exe Q1reducer.py" ^

-input hdfs://localhost:9001/user/hive/warehouse/bdsassignment.db/ret_table_data ^

-output /BDS_Assignment2/result

# Command to list the contents of the /BDS_Assignment2/result directory:

hadoop fs -ls /BDS_Assignment2/result

# Once the MapReduce job completes, you can view the output using the following command:

hadoop fs -cat /BDS_Assignment2/result/part-00000

# To delete the directory /BDS_Assignment2/result in HDFS,

hadoop fs -rm -r /BDS_Assignment2/result

stop-dfs

stop-yarn

Ramana video : https://youtu.be/HmBtvWhs2ng

Ramana linkedin : https://www.linkedin.com/in/akula-v-746ab026

*********************************************************** THANK YOU **************************************************************************

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.Host/author is not responsible for these posts.

Friday, March 3, 2023

Install Hive/Derby on top of Hadoop in Windows

To install Hive/Derby on top of Hadoop, you can follow these general steps:

Versions which are compatible are

Hadoop : 2.8.0

Hive : 2.1.0

Derby : 10.12.1.1

1.Download and install Hadoop: You can download the Hadoop distribution from the Apache Hadoop website (https://hadoop.apache.org/)

and follow the installation instructions.

For Downloading :https://archive.apache.org/dist/hadoop/common/hadoop-2.8.0/

file name : hadoop-2.8.0.tar.gz

links to refer : https://muhammadbilalyar.github.io/blogs/How-to-install-Hadoop-on-Window-10/

https://www.youtube.com/watch?v=GfixwKmS8Ro

2.Download and install Hive: You can download the Hive distribution from the Apache Hive website (https://hive.apache.org/)

and follow the installation instructions.

For Downloading :https://archive.apache.org/dist/hive/hive-2.1.0/

file name : apache-hive-2.1.0-bin.tar.gz

links : https://www.youtube.com/watch?v=BV1FgKYjc8E

3.Configure Hive to work with Hadoop: In the Hive configuration files, you need to set the location of the Hadoop

installation and configure other parameters such as the database location and the Hive metastore.

4.Start the Hadoop services: Start the Hadoop services using the start-all.sh script located in the Hadoop bin directory.

Start the Hive services: Start the Hive services using the hive command.

5.Download and install Derby: You can download the Derby distribution from the Derby website

and follow the installation instructions.

For Downloading :https://db.apache.org/derby/releases/release-10.12.1.1.html

file name : db-derby-10.12.1.1-bin.tar.gz

Quick commands:

http://localhost:50070/dfshealth.html#tab-overview

http://localhost:8088/cluster

C:\hadoop\hadoop-2.8.0\sbin>start-all.cmd

C:\hive\apache-hive-2.1.0-bin\bin>hive

C:\derby\db-derby-10.12.1.1-bin\bin>startNetworkServer -h 0.0.0.0

----------------------------------------------------------------------------

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.