Discover what is the text about

Topic Discovery

Won't it be so cool if you discover what is this book, post, tweet, magazine ... etc talking about, what are the topics discussed! ! with out judging from the cover?? something like when you skim the text. But it is not you who will grasp what is it about. Its your computer 😮. OMG these computers are doing cool stuff these days.

The trendy term these days "Machine Learning" is standing behind every tech related to computer sees, read, predicts.. anything that is normal for us as humans. Machines would learn it ( But we will still smarter since its us who design them the algorithms- What if they learn how to design their algorithms them selves LOL!). Well, Topic Modeling is representing the text as a model (simple ha!) in order to do text-mining and extract the underlying data in the texts, this term is well known in Machine learning & Natural Language Processing field of study. More info [here]

Enough introduction theory, lets get it work and see how cool is it:

We will use an algorithm called Latent Dirichlet allocation (LDA) to generate the topic based on word frequency. More info [here]

Requirements:

[Python]: I ran the example with python2.7
[NLTK]: Natural Language processing package for python
[STOP_WORDS]: A collection of meaningless words for such operation, as a python package
[Gensim]: Topic modeling package containing the LDA algorithm

You can fast forward and try the whole concept [here]

Steps:

Structure our project

text_tutorial
    ├── basic.py
    ├── docs.py
    ├── utils.py

The input text: I wrote some sentences in the docs.py file, assuming we have more than one document and we want to process them all. but since its only a tutorial i will represent every document as a string of sentences. Here for example the about [Falafel] food we are talking


doc_a = "Falafel is good to eat. My brother likes to eat Falafel sandiwshes, but not my mother."
doc_b = "My mother spends a lot of time feeding my brother healthy food with no oil."
doc_c = "Some health experts suggest that falafel  may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at football, but my mother never seems to feed my brother to do better."
doc_e = "Health professionals say that falafel is good not for your health. nah !"

Text pre-processing: Well we have to parse the data before modeling it so the text is understood able by the algorithm in order to build the model in the utils.Preprocessor class in utils.py module we have

Tokenizating: we will split the every sentence into single tokens ( a token could be a word, a number a character) for the simplicity of this example we will split it into words ( every space, semicolon , dot ) and that will produce a list containing words
Stop word removing: words like (a, the, to ..) are unnecessary to build the model and may make confusion so we will remove these words and produce a list excluding them
Stemming: like, likes, liked are the related to the same origin verb 'like'.


from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from stop_words import get_stop_words

class Preprocessor:
    tokenizer=None
    stemmer=None
    stopper=None
    
    def __init__(self):
        self.tokenizer = RegexpTokenizer(r'\w+')
        self.stopper = get_stop_words('en')
        self.stemmer = PorterStemmer()

    def pre_process_doc(self, doc=None):
        if not doc:
            raise AttributeError("No input")

        tokens = self.tokenizer.tokenize(doc)
        stopped_tokens = [i for i in tokens if not i in self.stopper]
        stemmed = [self.stemmer.stem(i) for i in stopped_tokens]
        return stemmed
    
    def pre_process_docs(self, docs=[]):
        if len(docs) == 0:
            raise AttributeError("No input")

        return [self.pre_process_doc(d) for d in docs]

Constructing our LDA model:
1. Create a dictionary that tie every word to a unique ID
```
dictionary = corpora.Dictionary(text)
```
2. Create corpus representation of the whole text by instantiating Bag of words, the corpus is a vector of vectors ( number of vectors = number of documents entered ). Each inner vector is contains tuples. Every tuple is ( term/word ID, frequency).
```
corpus = [dictionary.doc2bow(t) for t in text]
```
3. Generate the LDA model, pass the number of topics you want it to predict, so the algorithm will do backtracking till it get optimized solution.
```
ldamdl = LdaModel(corpus, num_topics=2, id2word=dictionary, passes=40)

print ldamdl.print_topics(num_topics=2, num_words=2) 
```
The output (in my falafel case) is 2 topics containing 2 words each with the probability that the topics discussed in documents is about
```
[(0, u'0.081*"health" + 0.058*"pressur"'), (1, u'0.081*"good" + 0.078*"falafel"')] 
```

Source code [here]

Easy weasy code

Search This Blog