Friday, March 31, 2023

Websearch

 Search engine

 1. Crawler Based 
 2. Directory 
 3. Metasearch 
 
Search Types

 1. General Search / Horizontal Search : ex : google --> results are very broad and results might not be relevant sometimes.
 2. Vertical search - very specific search or specific part of internet ; ex: google images / Amazon product search.
 



 
 Web challenges for IR 
 ---------------------
 1.Distributed Data: Documents spread over millions of  different web servers.
 2.Volatile Data: Many documents change or disappear  rapidly (e.g. dead links).
 3.Large Volume: Billions of separate documents.
 4.Unstructured and Redundant Data: No uniform  structure, HTML errors, up to 30% (near) duplicate  documents.
 5.Quality of Data: No editorial control, false information,  poor quality writing, typos, etc.
 6.Heterogeneous Data: Multiple media types (images,  video, VRML), languages, character sets, etc.
 
 
Modeling the Web
----------------

Heaps’ and Zipf’s laws are also valid in the Web.
»In particular, the vocabulary grows faster (larger) and the word
distribution should be more biased (larger)

Heaps’ Law
» An empirical rule which describes the vocabulary growth as a  function of the text size.
» It establishes that a text of n words has a vocabulary of size O(nš›½) for
0< š›½ <1

Zipf’s Law
» An empirical rule that describes the frequency of the text words.
» It states that the i-th most frequent word appears as many times as
the most frequent one divided by i š›½, for some š›½ >1


Different types of queries 

1. Informational queries : learn about something - 40%
2. Navigational queries : take to a page - 25%
3. Transactional queries : want to do something - 35%

Essential Characteristics for  user-friendliness of  a website 
1.Mobile Compatibility
2.Accessible to All Users
3.Well Planned Information Architecture
4.Well-Formatted Content That Is Easy to Scan
5.Fast Load Times
6.Browser Consistency
7.Effective Navigation
8.Good Error Handling
9.Contrasting Color Scheme
10.Usable forms


Centralized Architecture - Crawler-Indexer Architecture
important components
1.Crawler / spider 
2.Indexer 
3.Query Engine  




Indexing process
1. text acquisition 
2. text transformation
3. Index creation

Query Process
1.User Interaction
2.Ranking
3.Evaluations



Distributed Architecture - 
Harvest 
  Gathers and Brokers


 
  
User Interface 
 query interface
 Answer interface 


---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Ranked Retrieval Evaluation - Rank Based Measures - Non- Binary relevance

Non- Binary relevance

Documents are rarely entirely relevant or non-relevant  to a query
Many sources of graded relevance judgments
Relevance judgments on a 5-point scale
Multiple judges
Click distribution and deviation from expected levels
(but click-through != relevance judgments)





Normalized Discounted  Cumulative Gain (NDCG)















---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. 
Host/author is not responsible for these posts.

Ranked Retrieval Evaluation - Rank Based Measures - Binary relevance

Binary relevance -->  we are only worried if it is relevant or not 


Precision@K (P@K) -- only top documents ie compute  % relevant documents are picked in K


Practical Example : Practical Example

-----------------------------------------------------------------
Mean Average Precision (MAP): 






-----------------------------------------------------------------------

Mean Reciprocal Rank (MRR)







---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.Host/author is not responsible for these posts.

Wednesday, March 29, 2023

Unranked Retrieval Evaluation- Precision and Recall based on documents and F-Measure

Unranked Retrival Information
important measures are 

Precision:The ability to retrieve top-ranked documents that are mostly relevant. 

Recall:The ability of the search to find all of the  relevant items in the corpus.




F-measure : 
1.Takes into  account both recall and precision.
2. Harmonic mean of recall and precision










----------------------------------------------------------------------------
 All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.
Host/author is not responsible for these posts.

TREC Benchmark / Measuring Relevance / Evaluating IR system

Measures for Search Engine
1. Latency of search
2. Expressiveness of query Language 
Ability to express complex information needs
Speed on complex queries

Measuring Releavance based on User 
Three elements:
1. A benchmark document collection
2. A benchmark suite of queries
3. An assessment of either Relevant or  Nonrelevant for each query and each  document

TREC Benchmark - Text REtrieval Conference 





Evaluating IR system -- need to verify if retrived document is relaveant or not

Difficulties in Evaluating IR Systems
1.Effectiveness is related to the relevancy of  retrieved items.
2.Relevancy is not typically binary but continuous.
3.Even if relevancy is binary, it can be a  difficult judgment to make.
Relevancy, from a human standpoint, is:
Subjective: Depends upon a specific user’s  judgment.
Situational: Relates to user’s current needs.
Cognitive: Depends on human perception and  behavior.
Dynamic: Changes over time.
Evaluating IR systems 
1. Gold Standard (Human Labeled Corpora) : Using Humans to create Gold standard - Manual


----------------------------------------------------------------------------
 All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Tuesday, March 28, 2023

Anu's NLP Session on BITS WILP Pre Mid Semester Topic - 2 - DSECLZG525



Below vedio has the complete numerical explanation for NLP topics 
1.Grammars and Parsing
2.A Top Down Parser
3. A Bottom- Up Chart Parser
4.Parsing Techniques 
5.Context Free Grammars
5. Probabilistic CKY Parsing of PCFGs
6.Problems with PCFGs.




Session 1 video : Session 1 

Anu Garg linkedin : https://www.linkedin.com/in/anu-garg-9ab13962 





---------------------------------------------------------------------------- 

All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.Host/author is not responsible for these posts.

Anu's NLP Session on BITS WILP Pre Mid Semester Topic - 1 - DSECLZG525


Below Video has complete numerical explanation for NLP topics like 
1.Natural Language Understanding and Generation
2.N-gram Language Modelling
3.Part-of-Speech Tagging
4.Hidden Markov Model Algorithms
5.The Forward Algorithm
6.The Viterbi Algorithm 
7.The Forward-Backward Algorithm
8.Maximum Entropy Markov Model
9. Laplace smoothing.



Part 2 Session Link : Session 2












---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.
Host/author is not responsible for these posts.

Thursday, March 9, 2023

INSTALLING HADOOP, HIVE IN WINDOWS & WORKING WITH HDFS -- Ramana (Along with recording)

########-----------INSTALLING HADOOP, HIVE IN WINDOWS & WORKING WITH HDFS -------------- ########
################ ------------ AKULA VENKATA RAMANA ------------ #################

Link to download Java JDK 8: https://www.oracle.com/java/technologies/downloads/#java8 (go for java version 1.8.0_261)
Link to download Hadoop(hadoop - 2.8.0): https://muhammadbilalyar.github.io/blogs/How-to-install-Hadoop-on-Window-10/


HIVE Installtion on WINDOWS:
https://www.youtube.com/watch?v=npyRXkMhrgk

Download Hive -
https://archive.apache.org/dist/hive/hive-2.1.0/

Download Derby Metastore-
https://archive.apache.org/dist/db/derby/db-derby-10.12.1.1/

Download hive-site.xml-
https://drive.google.com/file/d/1qqAo7RQfr5Q6O-GTom6Rji3TdufP81zd/view


# How to extract .tar.gz files on Windows 11
Run cmd as Admin:
tar -xvzf source file path -C destination path


# Format the HDFS filesystem: Run the following command in the admin command prompt to format the HDFS filesystem: 
hdfs namenode -format

# Start Hadoop services: Run the following command in the command prompt to start the Hadoop services: start-dfs & start-yarn
start-dfs
start-yarn

# Verify installation: run the following command in the command prompt to verify that the Hadoop cluster is running: jps
C:\hadoop-2.8.0\sbin>jps
16592 ResourceManager
19296 DataNode
15096 Jps
18776 NameNode
27644 NodeManager

if we get all nodes & jps printed on console that means you have now successfully configured Hadoop in pseudo-distributed mode on Windows.
pseudo nodes means:  In Hadoop, the pseudo-distributed mode is a single-node Hadoop cluster, where all the Hadoop daemons run on a single machine. 
This mode is useful for testing and development purposes when we want to simulate a distributed Hadoop environment on a single machine.

In pseudo-distributed mode, all the Hadoop daemons such as NameNode, DataNode, ResourceManager, and NodeManager run in separate Java Virtual Machines (JVMs) 
on the same machine. This means that we can use Hadoop commands and APIs to perform the same tasks as we would in a fully distributed mode cluster.

********************************************** Installation completed *****************************************************************************

What next? Once the Hadoop cluster running.....

# Start Derby > bin
derby-10.12.1.1

# Start the port:
startNetworkServer -h 0.0.0.0    

to get the below connections on port 1527
Fri Mar 03 14:51:19 IST 2023 : Security manager installed using the Basic server security policy.
Fri Mar 03 14:51:19 IST 2023 : Apache Derby Network Server - 10.12.1.1 - (1704137) started and ready to accept connections on port 1527

# Start Hive: hive-2.1.0

# If you get this error: C:\hive-2.1.0\bin>hive
"Missing hadoop installation: C:\hadoop-2.8.0\bin must be set" then go to hadoop directory and set the below path:

set HADOOP_HOME=C:\hadoop-2.8.0

Now start Hive:


C:\hive-2.1.0\bin>hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/C:/hive-2.1.0/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/C:/hadoop-2.8.0/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
ERROR StatusLogger No log4j2 configuration file found. Using default configuration: logging only errors to the console.
Connecting to jdbc:hive2://
Connected to: Apache Hive (version 2.1.0)
Driver: Hive JDBC (version 2.1.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 2.1.0 by Apache Hive
hive> 


# Show Databases / Dropping Tables / Dropping Databases etc
show databases;
use bdsassignment;
DROP TABLE ret_table_data;
DROP DATABASE bdsassignment;


# Creating a data base in Hive
create database bdsassignment;

# Use that database
use bdsassignment;

# Creating table in Hive
CREATE TABLE ret_table_data (
  record_no INT,
    invoice STRING,
    stockcode STRING,
    description STRING,
    quantity INT,
    invoicedate TIMESTAMP,
    price DOUBLE,
    customer_id INT,
    country STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
WITH SERDEPROPERTIES ( 
  "separatorChar" = ",", 
  "quoteChar"     = "\"", 
  "escapeChar"    = "\\" 
STORED AS TEXTFILE
TBLPROPERTIES ("skip.header.line.count"="1");



# Now load the data intot the Hive table
load data local inpath 'C:\Assignment -2  2023 BDS DATA SET online_retail_data.csv' into table ret_table_data;


# To check if the data is propely loaded
select * from ret_table_data limit 20;


# For checking the Table description
describe formatted ret_table_data;


# Location where it is stored in hdfs: (check this in hadoop-2.8.0 command)
hdfs://localhost:9001/user/hive/warehouse/bdsassignment.db/ret_table_data

# To check the lehgth of the data set
select count(*) from ret_table_data;

# 2) Country from which the maximum revenue was collected from sales in the month of March 2010

SELECT country, SUM(quantity*price) AS revenue
FROM (
  SELECT country, quantity, price,
         substr(invoicedate, 1, 10) AS inv_date
  FROM ret_table_data
  WHERE ((invoicedate LIKE '%03-2010%') OR (invoicedate LIKE '%03/2010%'))
) t
GROUP BY country
ORDER BY revenue DESC
LIMIT 1;

# To monitor the Job status
http://localhost:8088

http://localhost:50070


# add the following line to your mapred-site.xml file, replacing the path with the correct path to your Python executable:
<property>
  <name>mapreduce.executable</name>
  <value>C:\Users\venka\anaconda3\pkgs\python-3.9.12-h6244533_0\python.exe</value>
</property>

# To leave the safemode
hdfs dfsadmin -safemode leave

# Run the MapReduce job using the following command: ensure to have py files of mapper & reducer on specified path:

# Now run the below command in Hadoop:

 hadoop jar C:\hadoop-2.8.0\share\hadoop\tools\lib\hadoop-streaming-2.8.0.jar ^
-files "file:///C:/wordcoutmapper.py,file:///C:/wordcoutreducer.py" ^
-mapper " C:\Users\venka\anaconda3\pkgs\python-3.9.12-h6244533_0\python.exe wordcoutmapper.py" ^
-reducer "C:\Users\venka\anaconda3\pkgs\python-3.9.12-h6244533_0\python.exe wordcoutreducer.py" ^
-input /BDS_Assignment2/word_count_data.txt ^
-output /BDS_Assignment2/result

 hadoop jar C:\hadoop-2.8.0\share\hadoop\tools\lib\hadoop-streaming-2.8.0.jar ^
-files "file:///C:/Q1mapper.py,file:///C:/Q1reducer.py" ^
-mapper " C:\Users\venka\anaconda3\pkgs\python-3.9.12-h6244533_0\python.exe Q1mapper.py" ^
-reducer "C:\Users\venka\anaconda3\pkgs\python-3.9.12-h6244533_0\python.exe Q1reducer.py" ^
-input hdfs://localhost:9001/user/hive/warehouse/bdsassignment.db/ret_table_data ^
-output /BDS_Assignment2/result

# Command to list the contents of the /BDS_Assignment2/result directory:
hadoop fs -ls /BDS_Assignment2/result

# Once the MapReduce job completes, you can view the output using the following command:
hadoop fs -cat /BDS_Assignment2/result/part-00000



# To delete the directory /BDS_Assignment2/result in HDFS,
hadoop fs -rm -r /BDS_Assignment2/result

stop-dfs
stop-yarn




*********************************************************** THANK YOU **************************************************************************



---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.Host/author is not responsible for these posts.

Friday, March 3, 2023

Install Hive/Derby on top of Hadoop in Windows


To install Hive/Derby on top of Hadoop, you can follow these general steps:

Versions which are compatible are 
Hadoop : 2.8.0
Hive : 2.1.0
Derby : 10.12.1.1

1.Download and install Hadoop: You can download the Hadoop distribution from the Apache Hadoop website (https://hadoop.apache.org/
and follow the installation instructions.

file name : hadoop-2.8.0.tar.gz


2.Download and install Hive: You can download the Hive distribution from the Apache Hive website (https://hive.apache.org/
and follow the installation instructions.
file name : apache-hive-2.1.0-bin.tar.gz


3.Configure Hive to work with Hadoop: In the Hive configuration files, you need to set the location of the Hadoop 
installation and configure other parameters such as the database location and the Hive metastore.
4.Start the Hadoop services: Start the Hadoop services using the start-all.sh script located in the Hadoop bin directory.
Start the Hive services: Start the Hive services using the hive command.
5.Download and install Derby:  You can download the Derby distribution from the Derby website 
and follow the installation instructions.
file name : db-derby-10.12.1.1-bin.tar.gz

Quick commands:

http://localhost:50070/dfshealth.html#tab-overview
http://localhost:8088/cluster
C:\hadoop\hadoop-2.8.0\sbin>start-all.cmd
C:\hive\apache-hive-2.1.0-bin\bin>hive
C:\derby\db-derby-10.12.1.1-bin\bin>startNetworkServer -h 0.0.0.0



---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Saturday, February 18, 2023

Makeup - Mid Semester- Questions and answers - Deep Learning - DSECLZG524 - Jan 2023










---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Regular - Mid Semester - Deep Learning- questions and answers - DSECLZG524 - 7th Jan 2023










Midsemester - Makeup - SPA - Question Paper with answers - Jan 2023

Birla Institute of Technology & Science, Pilani

Work Integrated Learning Programmes Division

First Semester 2022-2023

 

Mid-Semester Test

(EC-2 Makeup – ANSWER KEY)

 

Course No.                  : DSECL ZC556

Course Title                : Stream Processing and Analytics

Nature of Exam           : Open  Book

No. of Pages        =  4

No. of Questions =  5

Weightage                   : 30%

Duration                      : 2 Hours 

Date of Exam              : 06/03/2021 or 19/03/2021  (FN/AN)

Note to Students:

1.       Please follow all the Instructions to Candidates given on the cover page of the answer book.

2.       All parts of a question should be answered consecutively. Each answer should start from a fresh page. 

3.       Assumptions made if any, should be stated clearly at the beginning of your answer.

 

Q1. Consider an online food ordering and delivery platform which enables the customers to browse the nearby serving restaurants any time, explore their menu options, order the food and get it delivered at their doorsteps, also provide the ratings for the foods. The platform also enables the restaurant owners to analyze the click-stream as well as historical data related to their restaurant so that they can improvise their decisions with respect to the promotional offers made on the platform.                                                                                                           [1 +1 + 2 + 1 + 1 = 6]

 

a)      Identify the different data sources involved in this scenario. Also label them as internal or external.

b)      If you have been asked to design the customer data stream in this scenario, how it will look like?

c)      Give two examples of exploratory data analysis that restaurant owners can perform with this data? Give details.

d)      Recommend any suitable machine learning (with adequate justification) that can help the restaurant owners to improvise their decisions about marketing efforts.

e)      Which type of system architecture will be useful in this scenario?

 

Answer:

a)      Customer profile data, Restaurant data, Orders database, Food delivery agents data

All of them are of type internal to the system

b)      It will be mix of few customer + restaurant + order data

{“timestamp”, “cust_id”, “rest_id”, “order_id”, “items”, “payment_mode”…..}

c)      Some examples -

Find out factors affecting the ratings of the restaurant

Find out the relationship between the items those are ordered together

d)      Few examples –

Clustering – find out customers who are similar in terms of order, location etc

Regression – to predict the customer rating for this order or restaurant or to determine when he will place his next order or estimate the next order pricing

Classification – to predict whether customer will place an order in next n days or not or identify the customer to make an offer / discount

Recommendation – recommend the menu items to the customers

e)      Lambda – historical data for prediction, real time data for ordering and tracking etc.

 

 

Q2. Imagine that you are building a real time traffic routing system. Identify and justify the communication mechanisms used while making data available to the external world in the following use cases.                                                                                                                       [6]

 

a)      People driving around any city can use mobile application to get updates and be re-routed based on up-to-the moment traffic conditions. The user will post the interest for a particular route through mobile application and in turn the system will send the updates about the traffic conditions on that route.

b)      Police department wants to host web application that has a real time dashboard which continuously shows the traffic conditions at several points in the city and take appropriate action to resolve traffic congestions based on the data feeds those are coming from your streaming systems.

c)      A device is fitted into the commercial passenger vehicles which are roaming around the city. This device has the capability to show the current traffic conditions on the display and send the current coordinates of the vehicle and other relevant information back to the streaming system.

Answer:

a)      Webhooks

User registers the interest and application sends back the data to the mobile application

b)      Server sent events (SSE)

Connection between dashboard and system will remain open and data is made available to the client to update the dashboard

c)      WebSockets

Enables the two way communication between the device and streaming system

For each sub-question, Identification of each technique - 1 mark and adequate justification 1 mark

 

Q3. Consider a banking systems that has 4000 ATMs situated across many places in a particular region. The bank customer can carry out many transactions like balance check, money withdrawal and deposit, password reset. These transactions takes 4, 8 and 6 bytes respectively. Each of this transaction has a timestamp attached to it which takes 8 bytes. This transaction data needs to be stored at different places like at ATMs, at an immediate buffer at streaming system side and in a persistent storage. Narrate the three use-cases which require the data to be stored at these three places and estimate the sizing required for the same and suitable option for the storage.            [6]

 

Answer:

a)      At ATM –

May be just transactions data happened in ATM during last 30 minutes – if something goes wrong with any transaction like wrong password coming again and again for same customer, immediately raise an alarm to check that the person trying out the transaction is really card owner or not, local processing at ATM needs to be done

In memory databases can be used but sizing will vary from machine to machine and can be better estimated based on historical transaction data for that machine

 

b)      At buffer – immediate trend analysis wrt money withdrawal patterns and use it for feeding the ATMs with money most of the times

May be a day’s transactions data needs to be persisted in the buffer

Data flow tiers such as Kafka, Flume can be helpful

For storing processed data, caching systems can be useful

 

c)      At database – for historical data analysis , both (a) and (b) will get benefited by this sort of permanent storage like Databases or data stores

Sizing will depend upon business’s opinion about how recent data needs to be taken into consideration. 

For each correct identification – 1 mark and explanation 1 mark

 

Q4. Answer the following questions in brief:                                                                   [2 *3 = 6]

 

a) How Apache Flume blurs boundaries between data motion and processing?

Answer:

Interceptors are where Flume begins to blur line between data motion and processing.  Interceptor model allows Flume not only to modify the metadata headers of an event, but also allows the Event to be filtered according to those headers.

 

b) Mention two important ways by which processed streaming data can be made available to the end users?

Answer:

1) Through dashboards - by sending processed data to the visualizations placed on dashboards

2) Though alerts/notifications - by sending important/exceptions updates to the users through alerts or other notification channels

 

c)                  “Apache Kafka adopts a prescriptive order for reading and writing operations for a topic”. Justify this statement mentioning whether it is true or false. 

Answer:

True. Apache Kafka is not a queuing system like ActiveMQ. It does not follow the semantics that messages get processed as they are arrived. Kafka’s partitioning system does not maintain such structure.

 

Q5. Consider the following block diagram of data flow system based on Apache Flume. [1 +4 + 1 =6]

a)      Identify essential components from the perspective of Flume Agent.

b)      Provide the suitable configuration details for Apache Flume Agent that matches this data flow scenario. 

c)      Which type of data flow is represented with this block diagram?

Answer:

a)       

Three essential components of Flume Agent

·         Sources – Avro source, Thrift source and Syslog source for port monitoring data

·         Channels – Memory channel

·         Sink – HDFS sink, ElasticSearch sink

Identification of sources, channels and sinks – 1 mark

b)

Agent configuration should have mention of three sources, channel and sinks

myAgent.sources = myAvroSource, myThriftSource, mySyslogSource

myAgent.channels = myMemoryChannel

myAgent.sinks = myHDFSSink, myESSink

 

myAgent.source. myAvroSource.type = avro

myAgent.source. myThriftSource.type = thrift

myAgent.source. mySyslogSource.type = syslogtcp

 

myAgent.sink. myHDFSSink.type = hdfs

myAgent.sink. myESSink.type = elasticsearch

 

myAgent.channels. myMemoryChannel.type =memory

 

myagent.sources.myAvroSource.channel = myMemoryChannel

myagent.sources. myThriftSource.channel = myMemoryChannel

myagent.sources. mySyslogSource.channel = myMemoryChannel

myagent.sinks. myHDFSSink.channel = myMemoryChannel

myagent. sinks. myESSink.channel = myMemoryChannel

 

Definition of source, sink and channel – 0.5 mark

Configuration of source – 1 mark

Configuration of channel – 0.5 mark

Configuration of sink – 1 mark

Mapping between source/sink and channel – 1 mark

 

 

b)      Fan-in flow – from multiple sources to single channel

 

********************



---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.