Some of the tokenization issues are below
1. One-word or is it two words
2.Numbers
3.No Whitespace (Chinese language)
4. Ambiguous segmentation (Same word multiple meanings ex Chinese)
5.Bidirectional (ex : Arabic)
6.Accents and diacritics
7.case folding
8.Stop words
All the messages below are just forwarded messages if some one feels hurt about it please add your comments we will remove the post.
Host/author is not responsible for these posts.
No comments:
Post a Comment