Sunday, January 1, 2023

Information Retrieval -- DSECLZG537 - Mid Semester Question Paper - June 2021

 

Birla Institute of Technology & Science, Pilani

Work-Integrated Learning Programmes Division

June 2021

Mid-Semester Test

(EC-1 Regular)

Text Box: No. of Pages        = 2
No. of Questions = 2

 


Course No.                   : SS ZG537  

Course Title                  : INFORMATION RETRIEVAL  

Nature of Exam            : Closed Book

Weightage                    : 30%

 

Note:

1.       Please follow all the Instructions to Candidates given on the cover page of the answer book.

2.       All parts of a question should be answered consecutively. Each answer should start from a fresh page. 

3.       Assumptions made if any, should be stated clearly at the beginning of your answer.

 

Q1 – 2+5+3+5=15 Marks

A) Give an example of uncertainty and vagueness issues in Information retrieval [2 Marks]               

 

B) Explain the merge algorithm for the query “Information Retrieval”? What is the best order for query processing for the query “BITS AND Information AND Retrieval”? What Documents will be returned as output from the 15 documents? [5 Marks]



 


Solution:

Merge Algorithm - Intersecting two posting lists : Algorithm


Output document - 11

 

C) [3 Marks]

 

D) Build inverted index using Blocked sort-based Indexing for 50 million records. Explain the algorithm in detail with respect to indexing 50 million records.                            [5 Marks]

 

 

Q2 – 5+5+5=15 Marks

A)    Assume a corpus of 10000 documents.  The following table gives the TF and DF values for the 3 terms in the corpus of documents. Calculate the logarithmic TF-IDF values.                                                                                                           [5 Marks]

 

Term

Doc1

Doc2

Doc3

bits

15

5

20

pilani

2

20

0

mtech

0

20

15

 

Term

dft

 

bits

2000

pilani

1500

mtech

500

 

 

 

 

 

B) Classify the test document d6 into c1 or c2 using naïve bayes classifier. The documents in the training set and the appropriate class label is given below.  [5 Marks]

                                                                                     

 

 

Docid

Words in document

c= c1

c= c2

Training Set

d1

positive

Yes

No

 

d2

Very positive

Yes

No

 

d3

Positive very positive 

Yes

No

 

d4

very negative

No

Yes

 

d5

negative

No

Yes

Test Set

d6

Negative positive very positive

?

?

 

C) The search engine ranked results on 0-5 relevance scale: 2, 2, 3, 0, 5. Calculate the NDCG metric for the same. [5 Marks]

                                                                                                                      

 





---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Friday, December 30, 2022

Tokenization Issues - Information Retrieval

Some of the tokenization issues are below

1. One-word or is it two words 
2.Numbers
3.No Whitespace (Chinese language)
4. Ambiguous segmentation (Same word multiple meanings ex Chinese)
5.Bidirectional (ex : Arabic)
6.Accents and diacritics
7.case folding
8.Stop words 


---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. 
Host/author is not responsible for these posts.

Merge Algorithm - Intersecting two posting lists - Information Retrieval



---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Wednesday, December 28, 2022

Inverted index construction - Information Retrieval














---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Tuesday, December 27, 2022

Evaluation Measures - Information Retrieval








---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Functional View of Paradigm IR System - Information Retrieval




---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

The Process of Retrieving Information -- Information Retrieval







---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Data Retrieval vs Information Retrieval....


1. Matching.

In data retrieval we are normally looking for an  exact match, that is, we are checking to see whether  an item is or is not present in the file.
Ex: Select * from Student where per >= 8.0

In information retrieval more generally we want to  find those items which partially match the request  and then select from those a few of the best  matching ones.
Ex: Student having 8 or > 8 CGPA

2. Inference

In data retrieval is of the simple deductive kind, that is, a ∈ b and b ∈ c then a ∈ c.
In information retrieval it is of inductive inference; relations  are only specified with a degree of certainty or uncertainty  and hence our confidence in the inference is variable.

3.Model

Data retrieval is deterministic but information retrieval is  probabilistic.
Frequently Bayes' Theorem is invoked to carry out inferences in IR, but in DR probabilities do not enter into the processing.

4 .Classification:

In DR most likely monothetic classification is used.
That is, one with classes defined by objects possessing  attributes both necessary and sufficient to belong to a class.

In IR, polythetic classification is mostly used.
Each individual in a class will possess only a proportion of all the attributes possessed by all the members of that class..

5.Query Language:

The query language for DR is one with restricted  syntax and vocabulary.
In IR we prefer to use natural language although there  are some notable exceptions.

6.Query Specification:

In DR the query is generally a complete specification  of what is wanted,
In IR it is invariably incomplete.

7.Items wanted :

In IR we are searching for relevant documents as  opposed to exactly matching items in DR.

8.Error response:

DR is more sensitive to error in the sense that, an  error in matching will not retrieve the wanted item  which implies a total failure of the system.
In IR small errors in matching generally do not  affect performance of the system significantly




----------------------------------------------------------------------------
 All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Thursday, September 29, 2022

BITS-WILP-DSECLZG555 - Data Visualization and Interpretation - DVI - Final Question paper - 25092022













---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

BITS-WILP-DSECLZG565 - Machine Learning - ML - Final Question paper - 25092022











---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.
Host/author is not responsible for these posts.

BITS-WILP-DSECLZG523 - Introduction to Data Science - IDS - Final Question paper - 18092022













----------------------------------------------------------------------------
 All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. 
Host/author is not responsible for these posts.

BITS-WILP-DSECLZC413- Introduction to Statistical Methods - ISM - Final Question paper - 18092022









---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

Saturday, September 24, 2022

DSECLZG565- MACHINE LEARNING - Quick Calculators







---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

DSECLZG555-DATA VISUALIZATION AND INTERPRETATION - Story Telling Strategies

Establishing Context

who : Audience and you

what : Action , Mechanism and Tone

How : Data 

Story Telling Strategies 
------------------------

1. 3 Minute Story - telling the story with in 3 mins just by telling audience hat they need to know . No dependency on materials/visualization etc
Story teller needs to know what exactly data is saying .
2. Big idea -- Boils down to most importance sentence. It should articulate unique point of view / convey whats at stake / must be complete sentence .

3. Story boarding -- Establishes structure of communication. Visual outline of content . Use whiteboard , post -it etc. 


----------------------------------------------------------------------------

 All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.

DSECLZG555 - DATA VISUALIZATION AND INTERPRETATION - Gestalt Principles of Visual Perception

  1. Law of Prägnanz (Simplicity)
  2. Law of Similarity
  3. Law of Continuity
  4. Law of Focal Point
  5. Law of Proximity
  6. Law of Figure/Ground
  7. principle of enclosure
  8. principle of closure
  9. principle of continuity
  10. principle of connection
  11. principle of proximity
  12. principle of similarity

---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. 
Host/author is not responsible for these posts.

DSECLZG555 - DATA VISUALIZATION AND INTERPRETATION - Mistakes in Dashboard design

Mistakes in dashboard 

1. Design issues
          a. Exceeding screen
  b. meaningless variety 
  c. Clustering display 
  d. Unattractive visuals 
  
2. Data Issues 
a. Inadequate context for the data
b. Using deficient measure
c. Incorrect data encoding
d. Poor data arrangement
e. Ineffective data highlighting
3. Display Issues 
a. Inappropriate display media
poorly designed display media
13 Design Mistakes 

1.Exceeding the Boundaries of a Single Screen
2. Supplying Inadequate Context for the Data
        3. Displaying Excessive Detail or Precision
4. Choosing a Deficient Measure
5. Choosing Inappropriate Display Media
6. Introducing Meaningless Variety
7. Using Poorly Designed Display Media
8. Encoding Quantitative Data Inaccurately
9. Arranging the Data Poorly
10. Highlighting Important Data Ineffectively
11. Cluttering the Display with Useless Decoration
12. Misusing or Overusing Color
13. Designing an Unattractive Visual Display



---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. Host/author is not responsible for these posts.