Text Analytics: Topic Analysis with Python

Text Analytics: Topic Analysis with Python

Ever notice how the world is turning more semantic? Instead of directing a mouse or even a finger to instruct a piece of technology to do something, we are now able to simply instruct with our voices. Devices such as Alexa or Google Home are becoming increasingly integrated with other technologies creating new opportunities for semantics to dictate how those technologies operate. But how does a machine, like a computer or an Alexa device know what I am saying? The short answer is through text analytics.

Even the business world is becoming increasingly aware of the power of information “hidden” in the hundreds, thousands, millions, or even billions of text-based data points. In fact, one of the most common requests I get from business users goes something like this:

User: I know that staff members take note of those interactions with customers, but I don’t have any visibility over what they are putting in those notes. I want to know what they do. Maybe what they are taking note of has important implications for my business process.

In this blog post we will explore some fundamental processes involved in text analytics that will help us to help business requests like the one described above. The goal of this post will be to:

• Develop code to pre-process text data
• Perform topic analysis by groups
• Organize our results into a meaningful presentation
• Full code found at the bottom

Just a few additional notes on the code below. It is often the case that we want to discover topics that are hidden in text in more than one context. For example, maybe you want to know what males are talking about when compared to what females are talking about, or any other grouping for that matter. Thus, the example below explores topic analysis of text data by groups. This also differentiates this blog from other, excellent blogs, on the more general topic of text topic analysis.

Before starting, it is important to note just a few things regarding the environment we are working and coding in:

• Python 3.6 Running on a Linux machine
• NLTK 3.2.5
• Pandas 0.22.0
• Numpy 1.14.0
• Sklearn 0.19.1

Preprocessing Text Data

To begin, we must start with some text data. Because we are working from a business use case we will assume that the data are contained in some type of relational table (i.e. RDBMS) or csv format. Either way, the first step will be to get the text data loaded into memory using Python’s ever-powerful Pandas library. For this example I will read from a csv but there is a lot of really good documentation for querying from RDMS systems too (see pandas.read_sql()).

import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import sent_tokenize
from nltk.corpus import wordnet
from nltk import post_tag, WordNetLemmatizer

path = ‘/path/to/csv’
file = ‘csvfiletoload.csv’

df = pd.read_csv(path+file,encoding=’latin-1’)

In the code above, we first import necessary libraries. Then, in the second chunk, we declare a path variable that allows us to use the same path for reading and writing input/output. We also declare a file variable that will be the name of the csv file we want to bring into memory.

Finally, we define a Data Frame (df) that reads our csv file into memory as a Pandas Data Frame. But why the “encoding = ‘latin-1’” in the code? When dealing with text data different programs will save the data with different underlying encodings to preserve symbols that may not be available in all encoding schemes. This is particularly important for text data. Pandas assumes a limited set of encoding schemes when loading data from csv or text files and sometimes with text data, we get encoding errors. For a list of different encodings that can be specified with text file loads in Pandas see the encodings for Python here.

pd.options.mode.chained_assignment = None

I have added the code above as a way of dealing with writing over data frames as we clean our data. Anytime we want to process a data frame with some cleaning algorithm, we may not care about the old data and so will want to simply write-over the old data frame. Although it is not always a best practice to write over old data with new data, it is often more efficient for memory and so I simply suggest using your own discretion. Using the above code will stop Pandas from printing a warning to this effect.

text_col = ‘Your text column’
group_col = ‘Group Column’
df_text = df[[group_col, text_col]]

df_text[text_col] = df_text[text_col].replace(to_replace=r’[ , | ? | $ | . | ! | - | : ]’ , value = r’’, regex = True)
df_text[text_col] = df_text[text_col].replace(to_replace=r’[ ^a-zA-Z ]’ , value = r’ ’, regex = True)
df_text[text_col] = df_text[text_col].replace(to_replace=r’\s\s+’ , value = r’ ’, regex = True)

In the next set of code we first identify the column that contains our grouper like gender. We then identify our column that contains our text data. Once we have identified our two most important columns we create a new data frame of just those columns called df_text.

Finally, I have included 3 different sets of code for doing some initial processing of the text data using Regex functions. The first function replaces funny symbols with nothing in order to remove funny symbols from analysis. You can add more symbols that may be unique to your data set by adding a | and then the symbol after. The second regex function replaces all non-letters with a space. The last regex pattern removes extra blank spaces and replaces them with a single space to ensure that each word only contains one space to the next word. These obviously have overlapping effects so use one, all, or modify to your specific needs.

wnl = WordNetLemmatizer()
stop = set(nltk.corpus.stopwords.words(‘english’)
operators = set([‘not’,’n/a’,’na’])
stopwords = stop – operators

def remove_stopwords(tokens, stopwords):
	return [token for token in tokens if token not in stopwords]

def get_wordnet_pos(treebank_tag):
	if treebank_tag.startswith(‘J’):
		return wordnet.ADJ
	if treebank_tag.startswith(‘V’):
		return wordnet.VERB
	if treebank_tag.startswith(‘N’):
		return wordnet.NOUN
	if treebank_tag.startswith(‘R’):
		return wordnet.ADV
	else:
		return ‘n’

def lemmarati(tup_list):
	if not (np.all(pd.notnull(tup_list))):
		return tup_list
	outputlist = []
	for i, j in tup_list:
		pos = get_wordnet_pos(i,pos)
		lemma = wnl.lemmatize(i,pos)
		outputlist.append(lemma)
	return outputlist

In the next set of code, we are activating and setting up some functions that will allow us to do some more cleaning and normalizing of the text data. More specifically, the code sets up a function to remove stopwords, or words that are very common and as a result not all that meaningful (e.g. the). The remaining code also performs lemmatization. Lemmatization is a way of normalizing text so that words like Python, Pythons, and Pythonic all become just Python. Thus, lemmatization is like stemming but it takes the part of speech into account so that meet (v) and meeting (n) are kept separate.

Also, note that before defining our stopword list we remove some words that we want to keep in our topic analysis. Words like ‘not’ although often considered a stopword, can be very important when performing topic or sentiment analysis. Consider the difference between ‘happy’ and ‘not happy.’ The latter is the opposite of the former however if we used the nltk stopwords list we would remove ‘not’ from the list and run the risk of thinking most comments were ‘happy’ when in reality they were ‘not happy.’

I wanted to explain the code before invoking the functions, which we do below.

df_text[text_col] = df_text[text_col].map(lambda x: nltk.word_tokenize(x.lower()) if (np.all(pd.notnull(x))) else x.lower())

df_text[text_col] = df_text[text_col].map(lambda x: pos_tag(x) if (np.all(pd.notnull(x))) else x)

df_text[text_col] = df_text[text_col].map(lemmarati)

df_text[text_col] = df_text[text_col].map(lambda x: remove_stopwords(x,stopwords) if (np.all(pd.notnull(x))) else x)

In the code above we invoke the functions we created in the previous code block. The first line tokenizes (identify individual words) our text strings creating lists of word tokens. The next two sets of code perform parts of speech tagging (pos_tag) and then return the lemma for each word (lemmarati [I must have been inspired by the illuminati when I wrote this function 😉]). The last set removes the stopwords.

df_text[text_col] = df_text[text_col].map(lambda x: ‘ ‘.join(x) if (np.all(pd.notnull(x))) else x)

I include just one line of code because it is the last thing we will do before we move on to the topic analysis. When using scikit-learn to perform topic analysis we need to make sure we are submitting a string and not a list of word tokens. Thus, the above code gets the text back into sentence form, if you will. And now we are ready for topic analysis.

Topic Analysis using NMF (or LDA)

In the next section we perform Non-Negative Matrix Factorization (NMF), which can be thought of as similar to factor analysis for my behavioral science audience. Essentially, you first create a term document frequency matrix and then look for those terms that tend to show up together in documents at a higher frequency than other terms thus creating topics. This is a very high-level way of explaining the set of rather complicated algorithms ‘under the covers’ of this analysis but it should give you a good sense of how to interpret the results.

Another common and quite popular algorithm for topic analysis is Latent Dirichlet Allocation (LDA). Although quite popular, I often find NMF to produce better results with smaller data sets. Thus, we focus on NMF below but note that scikit-learn has a great example comparing both methods here where you will notice that the syntax for NMF is nearly identical to that of LDA.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

n_features = 1000
n_topics= 10

tfidf_vec = TfidfVectorizer(max_df = .95, min_df = 2, max_features = n_features, ngram_range = (2,3))

In the above code we start by importing the necessary libraries for topic analysis. In the next chunk we declare variables to limit our results to 1000 features and 10 topics. These settings can be adjusted to your preference. In the final set of code we activate the TfidfVectorizer with some important parameters. First, you may be asking what is Tfidf? This refers to Term-Frequency-Inverse-Document-Frequency. This of Tfidf as a weight for each word that represents how important the word is for each document in the corpus of documents. It is a slightly fancier way of creating a term frequency matrix. From Wikipedia “The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.”

Second, as you examine the parameters in the code I want to emphasize the ‘ngram_range(2,3)’ part. This tells the computer to create a tf-idf matrix of both 2 (bigram) and 3 (trigram) word pairs. In other words, we are asking the matrix to capture phrases, which can be quite meaningful for topic analysis.

Now, this next set of code performs the NMF analysis by group and saves the results as a list of data frames containing the top 5 bi and tri-grams for each of the 10 topics for each group. It is probably the least pythonic but lends itself to easy reporting to end users.

groups = df_text[group_col].unique()
results = []

for i in groups: 
	df_grp = df_text.loc[df_text[group_col] == i]
	if len(df_grp[text_col]) > 100:
		tf = tfidf_vec.fit_transform(df_grp[text_col])
		feature_names = tfidf_vec.get_feature_names()
		try:
			nmf = NMF(n_components = n_topics, random_state=1,alpha=.1, l1_ratio=.5).fit(tf)
			df_topics = pd.DataFrame(nmf.components_)
			df_topics.columns = feature_names
			df_top = df_topics.apply(lambda x: pd.Series(x.sort_values(ascending=False).iloc[:5].index,index=[‘top1’,’top2’,’top3’,’top4’,’top5’]), axis=1).reset_index()
			df_top[‘Group’] = i
			results.append(df_top)
		except:
			results.append(i+’ Did not produce topic results’)

In the code above, we first get a list of the unique groups in our grouping column. We then create a container (in this case a list) to hold our resulting data frames from the NMF topic analysis.

In the for loop, we perform a separate NMF analysis for each unique group contained in the grouping column. We use the ‘if len(df_grp[text_col]) > 100’ logic to ensure we have enough rows of text for the analysis. We use the ‘try:’ statement to ensure that the analysis will still run in case one of the groups gives us an error. In the ‘try:’ code we perform the NMF, extract the components into a data frame, label the data frame with the feature names (the bi and trigrams), selecting only the top 5 bi and trigrams for each topic based on their numeric contribution to the topic, add a column to the data frame to keep track of which group the topics are for, and append the results into our results list.

Now we have a list of data frames, which are not useful as a list so one more step before we finish.

topic_results = pd.concat(results,axis=0)
topic_results.to_csv(path+’my_NMF_results.csv’)

Now we are done. The ‘my_NMF_results.csv’ file now contains a nicely organized table of 10 topics by group showing the top 5 bi and trigrams that can help you to understand the business meaning of the topic. Your results should look something like this:

Stay tuned for future blogs where we will use the results of our topic analysis to score new text by topic, preform sentiment analysis, topic classification, and other analytics that will help us to meet the challenges when dealing with text data.

Feel free to add comments or questions. Be kind and respectful as unkind or disrespectful posts will be removed.

Full Code Set

import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import sent_tokenize
from nltk.corpus import wordnet
from nltk import post_tag, WordNetLemmatizer

path = ‘/path/to/csv’
file = ‘csvfiletoload.csv’

df = pd.read_csv(path+file,encoding=’latin-1’)
pd.options.mode.chained_assignment = None
text_col = ‘Your text column’
group_col = ‘Group Column’
df_text = df[[group_col, text_col]]

df_text[text_col] = df_text[text_col].replace(to_replace=r’[ , | ? | $ | . | ! | - | : ]’ , value = r’’, regex = True)
df_text[text_col] = df_text[text_col].replace(to_replace=r’[ ^a-zA-Z ]’ , value = r’ ’, regex = True)
df_text[text_col] = df_text[text_col].replace(to_replace=r’\s\s+’ , value = r’ ’, regex = True)
wnl = WordNetLemmatizer()
stop = set(nltk.corpus.stopwords.words(‘english’)
operators = set([‘not’,’n/a’,’na’])
stopwords = stop – operators

def remove_stopwords(tokens, stopwords):
	return [token for token in tokens if token not in stopwords]

def get_wordnet_pos(treebank_tag):
	if treebank_tag.startswith(‘J’):
		return wordnet.ADJ
	if treebank_tag.startswith(‘V’):
		return wordnet.VERB
	if treebank_tag.startswith(‘N’):
		return wordnet.NOUN
	if treebank_tag.startswith(‘R’):
		return wordnet.ADV
	else:
		return ‘n’

def lemmarati(tup_list):
	if not (np.all(pd.notnull(tup_list))):
		return tup_list
	outputlist = []
	for i, j in tup_list:
		pos = get_wordnet_pos(i,pos)
		lemma = wnl.lemmatize(i,pos)
		outputlist.append(lemma)
	return outputlist
df_text[text_col] = df_text[text_col].map(lambda x: nltk.word_tokenize(x.lower()) if (np.all(pd.notnull(x))) else x.lower())

df_text[text_col] = df_text[text_col].map(lambda x: pos_tag(x) if (np.all(pd.notnull(x))) else x)

df_text[text_col] = df_text[text_col].map(lemmarati)

df_text[text_col] = df_text[text_col].map(lambda x: remove_stopwords(x,stopwords) if (np.all(pd.notnull(x))) else x)
df_text[text_col] = df_text[text_col].map(lambda x: ‘ ‘.join(x) if (np.all(pd.notnull(x))) else x)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

n_features = 1000
n_topics= 10

tfidf_vec = TfidfVectorizer(max_df = .95, min_df = 2, max_features = n_features, ngram_range = (2,3))
groups = df_text[group_col].unique()
results = []

for i in groups: 
	df_grp = df_text.loc[df_text[group_col] == i]
	if len(df_grp[text_col]) > 100:
		tf = tfidf_vec.fit_transform(df_grp[text_col])
		feature_names = tfidf_vec.get_feature_names()
		try:
			nmf = NMF(n_components = n_topics, random_state=1,alpha=.1, l1_ratio=.5).fit(tf)
			df_topics = pd.DataFrame(nmf.components_)
			df_topics.columns = feature_names
			df_top = df_topics.apply(lambda x: pd.Series(x.sort_values(ascending=False).iloc[:5].index,index=[‘top1’,’top2’,’top3’,’top4’,’top5’]), axis=1).reset_index()
			df_top[‘Group’] = i
			results.append(df_top)
		except:
			results.append(i+’ Did not produce topic results’)
topic_results = pd.concat(results,axis=0)
topic_results.to_csv(path+’my_NMF_results.csv’)

Leave a Reply

Your email address will not be published. Required fields are marked *