Friday, December 30, 2022

Tokenization Issues - Information Retrieval

Some of the tokenization issues are below

1. One-word or is it two words 
2.Numbers
3.No Whitespace (Chinese language)
4. Ambiguous segmentation (Same word multiple meanings ex Chinese)
5.Bidirectional (ex : Arabic)
6.Accents and diacritics
7.case folding
8.Stop words 


---------------------------------------------------------------------------- 
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post. 
Host/author is not responsible for these posts.

No comments:

Post a Comment