What startups need to know about data science!

What startups need to know about data science!

For those of you embarking on or thinking of embarking on building a startup product but are concerned about not having data science built in to your solution then this article is for you. Obviously, I am referring to products or services that do not have data science, artificial intelligence, or machine learning as their core function, which is most of you.

Being in the field of data science and having worked with local startups, I often get asked how to enable data science [or artificial intelligence (Ai) or machine learning, all being relative synonyms for most people] for startup products or companies more generally. Unfortunately, the buzz has bitten the beast, so to speak, and in all too many cases people will nearly stop production because they get too caught up in the confusing mess of technology and statistics required to enable any real useful application of data science, let alone its computational partner Ai. I don’t say this to discourage but rather to help focus efforts where they matter most. In this post, we will examine the role that data science should play as you build your startup solution. To that end, we will discuss why your product doesn’t need data science yet, what to look for in your initial market testing for future data science opportunities, and how to begin to lay the groundwork for future data science integration.

Before going too much further, I want to pause briefly and quickly define what I mean by “data science.” Although the term is often obfuscated by associated technologies, when I refer to data science I am referring to the process of capturing data, transforming it so that it can be analyzed, using statistical models to find patterns in that data, and using those models to answer questions (e.g. make decisions). With this process we can answer complex questions that are challenging for humans to answer when the data are complex such as “Who is more at risk for a heart attack?” Or we can help machines to answer simple questions that are easy for humans but hard for machines such as “Is this a person or an animal?” But I digress, back to our discussion on data science and startups.

Why your startup doesn’t need data science…yet…

First, let’s consider how most startups work. Most startup companies get started because the inventors/creators have identified a human problem that they can solve with their own unique, and often combined experiences. What is very important to keep in mind here is that your solution, the one you developed without data science but experience (okay, maybe a little data science or research at least for the more rigorous of us), is solving a human problem without data science (*mind blown*). Seems obvious but it is fundamental. Second, when you keep this core focus in mind, you see that data science, just like everyone else trying to sell your new startup something it doesn’t need, is a distraction that is meant to tame your startup motivation. Moreover, this illusion is actively perpetuated by the giants who own solutions (touting the coolness of data science capabilities) in the fields you are looking to penetrate that ultimately leave you feeling as though you simply can’t compete without data science.

All hyperbole aside, the core message is important to repeat; your product was created sans data science and so should be brought to market sans data science. But that doesn’t mean that you can’t prepare for the future…

What to look for in your initial market release…

Once your product is ready for an alpha release, it now becomes important to address the future opportunities that data science may help to bring to your product. But how do you prepare for data science when you are still struggling to figure out what it means? Remember that data science can help us, us being people (owners, users, customers, etc) or machines (apps, robots, phones, etc), to answer questions. What this means is that you need to be sensitive to the questions that both you and your core customer base have as they experience your product.

Case in point, a startup develops an application that allows people to keep track of their college friends in one centralized location. Fast forward 14 years and Facebook is now a tech giant making strong and notable contributions to basic data science but by no means started there. What Zuckerberg did recognize was that his users had questions and he sought to identify ways through which the data he collected could help his users answer those questions (“Has anyone posted a photo of me?”, “If I have to see advertisements, what ads are most relevant to me?”, “Can’t Facebook just automatically tag my friends?”, etc.).

The take home for this section is to listen to your users as you roll out your product. Focus groups, surveys, emails, or any opportunity to receive feedback is an opportunity to add context to the continued evolution of your product or service. Examine the questions and challenges they have and consider whether your solution can collect the necessary information to possibly answer the question. If you identify some information that your product naturally collects from customers that may answer their question, then bingo…you have a data science use case. Thus, data science should be use-case driven such that each data science solution is attached to clear business value.

Okay, I have got my use cases, what now?

Although getting into specifics surrounding how to establish a data science pipeline like the one I describe at the beginning is beyond the scope of this article, I will leave you with a few ideas to consider along with some resources for digging deeper. The key to answering any question using data science starts with data (like it is literally the beginning of the phrase…duh). This means that you need to identify opportunities and some simple technologies to capture data.

Possible capture mechanisms include:

  • Relational Databases – SQLite, MySQL
  • Non-Relational Databases – MongoDB, PostgreSQL
  • File Systems – Basic Windows File system
  • Here is a useful description of some of the top open source DB solutions

Relational databases can be great if you know exactly what you want to capture but non-relational databases provide more flexibility for collecting information that has less structure. Finally, file systems (like the one on your PC where you save Word docs and family photos) can also be used but because they will capture anything, and they are not easy to extract information from, these may not be the best option. No matter which solution you choose, try to find one that allows you to automatically collect the information from your product or service. This ensures greater consistency in the data and reduces the potential problem of building biased insights for future analytics. In other words, promising me that you will remember to enter all those survey responses from customers and save them in a file somewhere probably isn’t a good data capture strategy. Once you have a good or even decent mechanism for capturing and saving data the remaining steps can get a bit complex and may require a more traditional data scientist consultant to build the insights you are interested in leveraging. It is important to note that at this point I am grossly oversimplifying the data science process but by the time you get to this point, hopefully you have generated enough revenue and identified enough high-value use cases that it will justify hiring some additional help. For those of you who are interested in more technical details around setting up a more robust data science pipeline, I highly recommend this blog series that teaches how to leverage cloud resources to execute an end-to-end data science pipeline.

Recognize the hurdles you must overcome before executing on data science in your products:

  • Hurdle 1: Do not over or under, but especially over in your early stages, -estimate the business value of data science
  • Hurdle 2: Be careful not to jump in without a defined plan and process
  • Hurdle 3: Keep in mind that collecting data means keeping information on people, so security and privacy will be important issues to address
  • Hurdle 4: When a high-value use case is identified, clearly define success metrics
  • Hurdle 5: Building data science requires some level of experience with data engineering, statistics, and scripting. Thus, it is essential to find a trusted partner to help enable your budding data science practice.

Thanks for reading and please feel free to reach out to let us know what you liked, didn’t like, or would like to see more of. We are particularly interested in any future content you would like us to examine so don’t be shy. Email (info@betacosine.com) or comment below.

Data Science: from PDF to Searchable DB

Data Science: from PDF to Searchable DB

Have you ever found yourself wanting to be able to process text data from a pdf file and then make that information searchable across files, not just within one document? Common problem, right? Well maybe not too common for the individual but one common example may be to create a searchable archive of reports that have been saved as pdf documents so that you can find reports that match certain interests with greater ease. Or perhaps you want to make emails from legacy systems searchable.

In this post we explore an application of computer vision to extract text from a pdf and then pump the text results into a searchable database so that the information is quickly accessible to an end user application. Think like Google but for documents not available online. The benefits of doing this are that you can make a large number of documents searchable very quickly (e.g. good for inpatient app users) and enable more complex search functionality over the text (e.g. good for picky app users).

In order to solve this problem there are a number of steps we must take, so bear with me across each section. I have tried to split the sections up into stand-alone sections that explain how to achieve part to this larger problem. We will use Python and ImageMagick to pre-process the pdf or image for text extraction, Tesseract to perform the computer vision piece of extracting text from an image, and sqlite as our database solution for creating a searchable repository of the extracted text.

Let’s name our environment before we begin:

  • Windows 10 OS
  • Tesseract 4.0
  • ImageMagick 6.9.10-Q8
  • Sqlite 3.14.2
  • Python 3.6 libraries:
    • Pytesseract 0.2.5
    • PIL 4.2.1
    • Wand 0.4.4
    • Sqlite3 2.6.0 (comes with Python 3.X as standard library)

Some notes on installing Tesseract and ImageMagick

Both Tesseract and ImageMagick are separate software tools that need to be installed on your environment in order to enable pdf-to-image processing, pre-OCR processing, and OCR (text extraction).

The focus here is on a Windows environment rather than Linux but much of the code we produce below will be the same. Open source tools like Tesseract and ImageMagick tend to be easier to load into Linux environments but since I have had to work in both I wanted to perform this in a Windows environment to show that it is possible. A few notes regarding installation in a Windows environment:

ImageMagick

You will want to use ImageMagick 6.9.10-Q8 instead of their latest version. This is because version 6 is their most stable with regard to the Python functions that we will use to leverage this piece of software. You can find the proper dll file here:
http://ftp.icm.edu.pl/packages/ImageMagick/binaries/?C=N;O=D
There you will find both 32 and 64 bit executable files. Because I am running on a 64 bit Windows 10 machine I downloaded the following file:

ImageMagick-6.9.10-11-Q8-x64-dll.exe

Once downloaded you can double click and follow the installer instructions.
Once ImageMagick is installed on your computer you will want to add a new environment variable called MAGICK_HOME with the location path to where your instance installed on your machine. For example, mine installed on the C drive in Program Files so my path variable looks like this:

C:\Program Files\ImageMagick-6.9.10-Q8

Tesseract

To get the installers for your specific Windows environment visit here and download the appropriate executable:
https://github.com/UB-Mannheim/tesseract/wiki
Once downloaded and installed on your machine you will want to add Tesseract to your PATH variables in your environment variables. When you access your environment variables, open your PATH variable and add the location of Tesseract to the list. My path looks like this:

C:\Program Files (x86)\Tesseract-OCR

Now that we have all of the non-python software installed we are ready to get started!

Sqlite

Sqlite is a very lightweight database software that has text indexing capabilities. Because it is the most commonly used database software in the world and it comes with Python we will use it here.

Let’s Get Coding!

First step is to make sure you have all necessary Python libraries installed and brought into memory. If you are using an interactive python console like IPython you can do this using the ‘shebang’ before pip like this !:

!pip install pytesseract==0.2.5
!pip install pillow==4.2.1
!pip install wand==0.4.4

Step 1: Convert a PDF to an Image

Once installed, we will start with the conversion of pdf to image since Tesseract cannot consume pdf’s. In order to create a bunch of pdf files quickly, I used an extension for Chrome called “Save-emails-to-pdf”. It is fast and allows you to save a lot of emails in Gmail to pdf files by simply checking the emails in the checkbox and clicking on the download button:

We will use these pdf files to convert to images, and then perform OCR. If you are thinking ‘hey, why not just use the pdf library in Python to extract the text directly,’ you would be correct in that creating pdf files like this does make the text extractable directly from the pdf code. The problem is that many pdf files do not have text embedded but rather represent images of text. In these instances, the only way to extract the text is to perform OCR from an image file. But I digress, onward…

from wand.image import Image
import wand.image
import wand.api
from os import listdir
path = ‘C:/betacosine/OCR/’
pdf_list = [x for x in listdir(path) if x.endswith('.pdf')]

For this example, I have placed the example pdf emails in the path described in the code above. We then use the listdir method to get a list of all the pdf files in the directory that was specified.

import ctypes
MagickEvaluateImage = wand.api.library.MagickEvaluateImage
MagickEvaluateImage.argtypes = [ctypes.c_void_p, ctypes.c_int, ctypes.c_double]

def evaluate(self, operation, argument):
    MagickEvaluateImage(
      self.wand,
      wand.image.EVALUATE_OPS.index(operation),
      self.quantum_range * float(argument))

Before we process the pdf files to images we need to set up our ImageMagick methods and functions that will be used to convert pdf files to images for OCR. I have found that Tesseract performs best on digital text extraction when we convert the pdfs to grayscale and use some level of thresholding to pre-process the image before passing through the OCR engine.

Thresholding is a simple and efficient way of separating the foreground from the background in an image. To complete this using ImageMagick we need to specify some additional pieces of information and so we develop a function that will perform thresholding for us.

There is another software tool called OpenCV that has an easier-to-use Python interface for thresholding and other image pre-processing but to keep things a little simpler here, I just focus on ImageMagick. See this great tutorial on using Tesseract with OpenCV.

 text_list = []

for i in pdf_list: 
    text_list2 = []
    with Image(filename=path+i) as img1:
        num_pgs = len(img1.sequence)
        for p in range(num_pgs):
            with Image(filename=path+i+"["+str(p)+"]",resolution=200)as img:
                img.type = 'grayscale'
                evaluate(img,'threshold',.60)
                img_buffer = np.asarray(bytearray(img.make_blob(format='png')), dtype='uint8')
                bytesio = io.BytesIO(img_buffer)
                text = pytesseract.image_to_string(PIL.Image.open(bytesio))
                text_list2.append(text)
    text_list.append(text_list2)

So, the code above is not the prettiest, I definitely get it. And this is where I show my cards as more of a data scientist than a programmer but the code above does work and it is efficient. Because it is complex, let me walk you through what is happening here. In the first line we are creating an open list container called text_list, which will be where we put our OCR’d text results. At the beginning of the “for loop,” we start by iterating over each of the pdf files in our directory. Because most of the pdf files are multiple pages and we want to OCR each page, we need to iterate our tasks over each page. Thus, we use the first “with” statement to get the total page numbers in each pdf file.

The second “for loop” iterates over each page number and performs the functions contained in the second “with” statement on each page. In that second “with” statement we start by converting the image to grayscale, then perform the thresholding. In the next line that starts with “img_buffer” we are creating a Numpy array out of the binary that we get when we use the “make_blob” method from ImageMagick. We then convert it to a bytes object so that we can open it using the PIL library. All of this is done so that we do not need to spend precious compute resources writing the image to disc and then reading it back into memory to perform OCR. This way we can just pass the object directly along to Tesseract for OCR.

Finally, we append the resulting text to our text_list2, which then gets appended to our text_list. What we are left with is a list of lists:

You will notice that I am only processing 6 emails here.

flat_list = ['\n'.join(map(str, x)) for x in text_list]

In the next line of code above, we simply flatten the resulting list into a single list by joining the multiple pages into a single list rather than a sublist.

At this point you could add some text processing steps that further improve the readability and accuracy of the text that has been extracted. See some example cleaning functions in one of our previous blog posts here.

Step 2: Creating a Searchable Database

Now that we have our text data, we want to enter it into a database that indexes the text field we want to be able to search on.

import sqlite3
sqlite_db = 'email_db.db'
zip_list = list(zip(pdf_list,flat_list))
conn = sqlite3.connect(sqlite_db)
c = conn.cursor()

In the next 2 lines of code we are importing the sqlite3 library and providing a name for our database (email_db, original, I know). We then join our list of pdf file names to our list of text results for each email in a tuple, which makes for fast insertion into the database.

In the last two lines of code we create a connection to the database. If the database does not exist, this will create it. Then we activate the cursor to be able to interact with the database.

c.execute('CREATE VIRTUAL TABLE email_table USING fts4(email_title TEXT, email_text TEXT)')
c.executemany('INSERT INTO email_table(email_title,email_text) VALUES(?,?)', zip_list)
conn.commit()

Next, we use Sqlite’s text indexing functionality by creating a table using the FTS4 extension. In Sqlite FTS3 – 6 provide text indexing capabilities that significantly reduce the time it takes to get results back from a query and add additional text searching capabilities. Read more about it here.

Finally, we insert our data from zip_list into the new table and commit it.

Done! You have now created a searchable database of text data that will respond in milliseconds to even the most complex text searches, over millions of rows. I have done this for over 12 million rows of data and get search results in .1 to .2 seconds.

In addition, you can now leverage more complex text querying features available in the FTS extensions for SQLite. For example, you can now search for words that are within a certain number of other words and return results. You can even return results that include additional characters highlighting where in the text your search term appears. I have included some example code below.

import pandas as pd
df = pd.read_sql('''select * from email_table where email_text MATCH 'trump NEAR/5 tweet*';''',conn)
conn.close()

In the above code, I use pandas to pull out a dataframe of results where I search for any rows that say trump and tweet within 5 words of each other. Pretty cool huh!?! Now slap a fancy UI on top and, bingo, a searchable interface with an indexed DB on the backend.

As always, we look forward to any comments, ideas, or feedback this may have inspired in you. Stay tuned for more ideas that enable Ai!

Is Data Science Dead?

Is Data Science Dead?

We have all heard it, or read about it, or both. The data scientist is dying and there is little we can do to hold on to our cushy salaries, rock-star-like images, and inflated egos. Obviously, I am overstating things here for dramatic effect, but the message is still anxiety-provoking for many data science professionals who have begun to smell blood in their industrial waters as concepts like “citizen data scientist,” “democratization of analytics,” and “automated machine learning” are being thrown around by more and more executive teams. Such fears were stoked earlier this year when Matt Tucker’s article “The Death of the Data Scientist,” was published on Data Science Central and Justin Dickerson’s on LinkedIn, though neither were the first to make such a claim. But are data scientists as we know them today, truly a breed bound for extinction? In the remainder of this post, I explore this idea while offering an alternative perspective on what the future may look like for the current data science professional.

The Rise of the Machines

At its most fundamental level, the argument goes something like this; many of the activities of the data scientist are quantifiable or statistical in nature and are, as a result, automatable. Therefore, the better we can orchestrate statistical models together in an automated fashion the less need there is for a data scientist to be pulling on those levers that select, optimize, and deploy their data-driven insights. Indeed, companies and products such as DataRobot, Google’s AutoML, and the ever-expanding access to pre-trained, service-based data science models (Azure Cognitive Services, Google Ai Services, AWS, Watson) have made significant strides to achieve just that, an artificial data scientist.

From Rise to Ride

Despite this dire prognosis of a field that was always poorly defined anyway, those who claim the title of data scientist have nonetheless developed ample skill to evolve with the coming wave of artificial data science. Thus, we must replace our conjured-up images of the Skynets of the world rising to overthrow the last remaining strongholds of human data scientists, with images of explorers riding the hype-wave of artificial intelligence technologies that are fundamentally embedded with skills that only the human data scientist truly understands. To achieve such evolution, there are three areas the practicing data scientist must focus on and that the employer of the future must inspire; an ever evolving/expanding toolkit, the importance of the user experience, and the evangelism of a trade.

The Ever Evolving/Expanding Toolkit

If there is one thing that data scientists are good at, it’s catching a buzz (and a few new tools along the way). The concept of data science itself is a buzz term that many professionals with any statistical understanding in business attached to themselves in order to improve their marketability, and to good effect. Why should we expect the building wave of artificial intelligence to be any different? As the concept of data scientist has evolved, so too have the tools associated with it, and thus the professionals in this field have been caught in a constant race to remain relevant by exposing themselves to the newest tools being made available. Although the rate of change has been near to overwhelming, those who have survived and been able to demonstrate competence around the core functionality of these data science technologies are well poised to take advantage of the tools of artificial intelligence. Thus, the data scientists who learn to evolve will learn how to rebrand themselves as practitioners of artificial intelligence. But to be able to convince others of this rebrand, such professionals will need to continue to expand on their toolkits. Whereas the early 2000’s brought us Hadoop, NoSQL, IoT, Python’s scikit-learn, Tensorflow, and Spark the next generation will be leveraging Ai-as-a-Service, cloud computing, intelligent automation, and containerization for analytics. This means that data scientists must continue to learn how to leverage API calls, architect cloud environments that support data science, and deploy analytics to expose API endpoints.

The Importance of the User Experience

As you can see from above, statistical tools are not the only tools that will help data scientists to survive in this quickly changing landscape. Artificial intelligence is not merely statistical technologies but rather it is the embedding of those statistical technologies into user experiences. Thus, the savvy data science survivalist will identify opportunities to solve problems using embedded statistical analytics. Such efforts will require a greater understanding of software programming concepts, which the data scientist is already well-poised for through the acquisition of open source scripting tools, and the ability to work more closely with application development teams. There are many ways to tackle the user experience problem from both a technical as well as a theoretical (see our previous blog post as one example) perspective and what works will always depend on satisfying the user but the key is to identify strategies whereby statistical models improve the user experience. In this way data scientists will need to continue to evolve their approach to problem solving. Where once we focused on using cutting edge modeling techniques to extract insights from data, we now need to focus on their utility within an application.

Evangelizing a Trade

And finally, because the true test of our data science products depends on the user’s ability to get value from them, we must be prepared to take our specialized understanding of these Ai-enabling technologies and empower the citizen data scientist rather than pontificate over the sacredness of our special anointed knowledge. Despite the apparent ease-of-use promised by the onslaught of automated data science products, citizen data scientists will still lack understanding of their application. As one Reddit user so elegantly put it “most people can barely use Excel, and even most data/business analysts have a hard time understanding anything beyond basic aggregation and statistics”. Thus, businesses will look to data scientists to train the citizen data scientist of the future to use those tools as use cases permit. The reason that data scientists will be required is because data science is not a tool but rather it is a way of thinking and tackling problems. Tools certainly enable new ways of thinking, but people need to be trained on how to think about the tool in order for the tool to change their approach to solving problems. In short, we must evangelize the tools that enable the artificial data scientist. In this vain, data scientists become the hub of both artificial and human data science products within an organization and the citizen data scientists the spokes.

From Data Scientist to Ai Practitioner

 

In conclusion, the data scientist is not dead, or dying for that matter, but is, instead, in need of a coming evolution. Those who are most successful in continuing to expand their tool kits to leverage Ai services, expose results to and interact with applications, and impart their way of thinking to enable others will be the most confidently poised to meet the coming needs of the Ai practitioner for the future of digital enterprise.

Text Analytics: Topic Analysis with Python

Text Analytics: Topic Analysis with Python

Ever notice how the world is turning more semantic? Instead of directing a mouse or even a finger to instruct a piece of technology to do something, we are now able to simply instruct with our voices. Devices such as Alexa or Google Home are becoming increasingly integrated with other technologies creating new opportunities for semantics to dictate how those technologies operate. But how does a machine, like a computer or an Alexa device know what I am saying? The short answer is through text analytics.

Even the business world is becoming increasingly aware of the power of information “hidden” in the hundreds, thousands, millions, or even billions of text-based data points. In fact, one of the most common requests I get from business users goes something like this:

User: I know that staff members take note of those interactions with customers, but I don’t have any visibility over what they are putting in those notes. I want to know what they do. Maybe what they are taking note of has important implications for my business process.

In this blog post we will explore some fundamental processes involved in text analytics that will help us to help business requests like the one described above. The goal of this post will be to:

• Develop code to pre-process text data
• Perform topic analysis by groups
• Organize our results into a meaningful presentation
• Full code found at the bottom

Just a few additional notes on the code below. It is often the case that we want to discover topics that are hidden in text in more than one context. For example, maybe you want to know what males are talking about when compared to what females are talking about, or any other grouping for that matter. Thus, the example below explores topic analysis of text data by groups. This also differentiates this blog from other, excellent blogs, on the more general topic of text topic analysis.

Before starting, it is important to note just a few things regarding the environment we are working and coding in:

• Python 3.6 Running on a Linux machine
• NLTK 3.2.5
• Pandas 0.22.0
• Numpy 1.14.0
• Sklearn 0.19.1

Preprocessing Text Data

To begin, we must start with some text data. Because we are working from a business use case we will assume that the data are contained in some type of relational table (i.e. RDBMS) or csv format. Either way, the first step will be to get the text data loaded into memory using Python’s ever-powerful Pandas library. For this example I will read from a csv but there is a lot of really good documentation for querying from RDMS systems too (see pandas.read_sql()).

import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import sent_tokenize
from nltk.corpus import wordnet
from nltk import post_tag, WordNetLemmatizer

path = ‘/path/to/csv’
file = ‘csvfiletoload.csv’

df = pd.read_csv(path+file,encoding=’latin-1’)

In the code above, we first import necessary libraries. Then, in the second chunk, we declare a path variable that allows us to use the same path for reading and writing input/output. We also declare a file variable that will be the name of the csv file we want to bring into memory.

Finally, we define a Data Frame (df) that reads our csv file into memory as a Pandas Data Frame. But why the “encoding = ‘latin-1’” in the code? When dealing with text data different programs will save the data with different underlying encodings to preserve symbols that may not be available in all encoding schemes. This is particularly important for text data. Pandas assumes a limited set of encoding schemes when loading data from csv or text files and sometimes with text data, we get encoding errors. For a list of different encodings that can be specified with text file loads in Pandas see the encodings for Python here.

pd.options.mode.chained_assignment = None

I have added the code above as a way of dealing with writing over data frames as we clean our data. Anytime we want to process a data frame with some cleaning algorithm, we may not care about the old data and so will want to simply write-over the old data frame. Although it is not always a best practice to write over old data with new data, it is often more efficient for memory and so I simply suggest using your own discretion. Using the above code will stop Pandas from printing a warning to this effect.

text_col = ‘Your text column’
group_col = ‘Group Column’
df_text = df[[group_col, text_col]]

df_text[text_col] = df_text[text_col].replace(to_replace=r’[ , | ? | $ | . | ! | - | : ]’ , value = r’’, regex = True)
df_text[text_col] = df_text[text_col].replace(to_replace=r’[ ^a-zA-Z ]’ , value = r’ ’, regex = True)
df_text[text_col] = df_text[text_col].replace(to_replace=r’\s\s+’ , value = r’ ’, regex = True)

In the next set of code we first identify the column that contains our grouper like gender. We then identify our column that contains our text data. Once we have identified our two most important columns we create a new data frame of just those columns called df_text.

Finally, I have included 3 different sets of code for doing some initial processing of the text data using Regex functions. The first function replaces funny symbols with nothing in order to remove funny symbols from analysis. You can add more symbols that may be unique to your data set by adding a | and then the symbol after. The second regex function replaces all non-letters with a space. The last regex pattern removes extra blank spaces and replaces them with a single space to ensure that each word only contains one space to the next word. These obviously have overlapping effects so use one, all, or modify to your specific needs.

wnl = WordNetLemmatizer()
stop = set(nltk.corpus.stopwords.words(‘english’)
operators = set([‘not’,’n/a’,’na’])
stopwords = stop – operators

def remove_stopwords(tokens, stopwords):
	return [token for token in tokens if token not in stopwords]

def get_wordnet_pos(treebank_tag):
	if treebank_tag.startswith(‘J’):
		return wordnet.ADJ
	if treebank_tag.startswith(‘V’):
		return wordnet.VERB
	if treebank_tag.startswith(‘N’):
		return wordnet.NOUN
	if treebank_tag.startswith(‘R’):
		return wordnet.ADV
	else:
		return ‘n’

def lemmarati(tup_list):
	if not (np.all(pd.notnull(tup_list))):
		return tup_list
	outputlist = []
	for i, j in tup_list:
		pos = get_wordnet_pos(i,pos)
		lemma = wnl.lemmatize(i,pos)
		outputlist.append(lemma)
	return outputlist

In the next set of code, we are activating and setting up some functions that will allow us to do some more cleaning and normalizing of the text data. More specifically, the code sets up a function to remove stopwords, or words that are very common and as a result not all that meaningful (e.g. the). The remaining code also performs lemmatization. Lemmatization is a way of normalizing text so that words like Python, Pythons, and Pythonic all become just Python. Thus, lemmatization is like stemming but it takes the part of speech into account so that meet (v) and meeting (n) are kept separate.

Also, note that before defining our stopword list we remove some words that we want to keep in our topic analysis. Words like ‘not’ although often considered a stopword, can be very important when performing topic or sentiment analysis. Consider the difference between ‘happy’ and ‘not happy.’ The latter is the opposite of the former however if we used the nltk stopwords list we would remove ‘not’ from the list and run the risk of thinking most comments were ‘happy’ when in reality they were ‘not happy.’

I wanted to explain the code before invoking the functions, which we do below.

df_text[text_col] = df_text[text_col].map(lambda x: nltk.word_tokenize(x.lower()) if (np.all(pd.notnull(x))) else x.lower())

df_text[text_col] = df_text[text_col].map(lambda x: pos_tag(x) if (np.all(pd.notnull(x))) else x)

df_text[text_col] = df_text[text_col].map(lemmarati)

df_text[text_col] = df_text[text_col].map(lambda x: remove_stopwords(x,stopwords) if (np.all(pd.notnull(x))) else x)

In the code above we invoke the functions we created in the previous code block. The first line tokenizes (identify individual words) our text strings creating lists of word tokens. The next two sets of code perform parts of speech tagging (pos_tag) and then return the lemma for each word (lemmarati [I must have been inspired by the illuminati when I wrote this function 😉]). The last set removes the stopwords.

df_text[text_col] = df_text[text_col].map(lambda x: ‘ ‘.join(x) if (np.all(pd.notnull(x))) else x)

I include just one line of code because it is the last thing we will do before we move on to the topic analysis. When using scikit-learn to perform topic analysis we need to make sure we are submitting a string and not a list of word tokens. Thus, the above code gets the text back into sentence form, if you will. And now we are ready for topic analysis.

Topic Analysis using NMF (or LDA)

In the next section we perform Non-Negative Matrix Factorization (NMF), which can be thought of as similar to factor analysis for my behavioral science audience. Essentially, you first create a term document frequency matrix and then look for those terms that tend to show up together in documents at a higher frequency than other terms thus creating topics. This is a very high-level way of explaining the set of rather complicated algorithms ‘under the covers’ of this analysis but it should give you a good sense of how to interpret the results.

Another common and quite popular algorithm for topic analysis is Latent Dirichlet Allocation (LDA). Although quite popular, I often find NMF to produce better results with smaller data sets. Thus, we focus on NMF below but note that scikit-learn has a great example comparing both methods here where you will notice that the syntax for NMF is nearly identical to that of LDA.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

n_features = 1000
n_topics= 10

tfidf_vec = TfidfVectorizer(max_df = .95, min_df = 2, max_features = n_features, ngram_range = (2,3))

In the above code we start by importing the necessary libraries for topic analysis. In the next chunk we declare variables to limit our results to 1000 features and 10 topics. These settings can be adjusted to your preference. In the final set of code we activate the TfidfVectorizer with some important parameters. First, you may be asking what is Tfidf? This refers to Term-Frequency-Inverse-Document-Frequency. This of Tfidf as a weight for each word that represents how important the word is for each document in the corpus of documents. It is a slightly fancier way of creating a term frequency matrix. From Wikipedia “The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.”

Second, as you examine the parameters in the code I want to emphasize the ‘ngram_range(2,3)’ part. This tells the computer to create a tf-idf matrix of both 2 (bigram) and 3 (trigram) word pairs. In other words, we are asking the matrix to capture phrases, which can be quite meaningful for topic analysis.

Now, this next set of code performs the NMF analysis by group and saves the results as a list of data frames containing the top 5 bi and tri-grams for each of the 10 topics for each group. It is probably the least pythonic but lends itself to easy reporting to end users.

groups = df_text[group_col].unique()
results = []

for i in groups: 
	df_grp = df_text.loc[df_text[group_col] == i]
	if len(df_grp[text_col]) > 100:
		tf = tfidf_vec.fit_transform(df_grp[text_col])
		feature_names = tfidf_vec.get_feature_names()
		try:
			nmf = NMF(n_components = n_topics, random_state=1,alpha=.1, l1_ratio=.5).fit(tf)
			df_topics = pd.DataFrame(nmf.components_)
			df_topics.columns = feature_names
			df_top = df_topics.apply(lambda x: pd.Series(x.sort_values(ascending=False).iloc[:5].index,index=[‘top1’,’top2’,’top3’,’top4’,’top5’]), axis=1).reset_index()
			df_top[‘Group’] = i
			results.append(df_top)
		except:
			results.append(i+’ Did not produce topic results’)

In the code above, we first get a list of the unique groups in our grouping column. We then create a container (in this case a list) to hold our resulting data frames from the NMF topic analysis.

In the for loop, we perform a separate NMF analysis for each unique group contained in the grouping column. We use the ‘if len(df_grp[text_col]) > 100’ logic to ensure we have enough rows of text for the analysis. We use the ‘try:’ statement to ensure that the analysis will still run in case one of the groups gives us an error. In the ‘try:’ code we perform the NMF, extract the components into a data frame, label the data frame with the feature names (the bi and trigrams), selecting only the top 5 bi and trigrams for each topic based on their numeric contribution to the topic, add a column to the data frame to keep track of which group the topics are for, and append the results into our results list.

Now we have a list of data frames, which are not useful as a list so one more step before we finish.

topic_results = pd.concat(results,axis=0)
topic_results.to_csv(path+’my_NMF_results.csv’)

Now we are done. The ‘my_NMF_results.csv’ file now contains a nicely organized table of 10 topics by group showing the top 5 bi and trigrams that can help you to understand the business meaning of the topic. Your results should look something like this:

Stay tuned for future blogs where we will use the results of our topic analysis to score new text by topic, preform sentiment analysis, topic classification, and other analytics that will help us to meet the challenges when dealing with text data.

Feel free to add comments or questions. Be kind and respectful as unkind or disrespectful posts will be removed.

Full Code Set

import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import sent_tokenize
from nltk.corpus import wordnet
from nltk import post_tag, WordNetLemmatizer

path = ‘/path/to/csv’
file = ‘csvfiletoload.csv’

df = pd.read_csv(path+file,encoding=’latin-1’)
pd.options.mode.chained_assignment = None
text_col = ‘Your text column’
group_col = ‘Group Column’
df_text = df[[group_col, text_col]]

df_text[text_col] = df_text[text_col].replace(to_replace=r’[ , | ? | $ | . | ! | - | : ]’ , value = r’’, regex = True)
df_text[text_col] = df_text[text_col].replace(to_replace=r’[ ^a-zA-Z ]’ , value = r’ ’, regex = True)
df_text[text_col] = df_text[text_col].replace(to_replace=r’\s\s+’ , value = r’ ’, regex = True)
wnl = WordNetLemmatizer()
stop = set(nltk.corpus.stopwords.words(‘english’)
operators = set([‘not’,’n/a’,’na’])
stopwords = stop – operators

def remove_stopwords(tokens, stopwords):
	return [token for token in tokens if token not in stopwords]

def get_wordnet_pos(treebank_tag):
	if treebank_tag.startswith(‘J’):
		return wordnet.ADJ
	if treebank_tag.startswith(‘V’):
		return wordnet.VERB
	if treebank_tag.startswith(‘N’):
		return wordnet.NOUN
	if treebank_tag.startswith(‘R’):
		return wordnet.ADV
	else:
		return ‘n’

def lemmarati(tup_list):
	if not (np.all(pd.notnull(tup_list))):
		return tup_list
	outputlist = []
	for i, j in tup_list:
		pos = get_wordnet_pos(i,pos)
		lemma = wnl.lemmatize(i,pos)
		outputlist.append(lemma)
	return outputlist
df_text[text_col] = df_text[text_col].map(lambda x: nltk.word_tokenize(x.lower()) if (np.all(pd.notnull(x))) else x.lower())

df_text[text_col] = df_text[text_col].map(lambda x: pos_tag(x) if (np.all(pd.notnull(x))) else x)

df_text[text_col] = df_text[text_col].map(lemmarati)

df_text[text_col] = df_text[text_col].map(lambda x: remove_stopwords(x,stopwords) if (np.all(pd.notnull(x))) else x)
df_text[text_col] = df_text[text_col].map(lambda x: ‘ ‘.join(x) if (np.all(pd.notnull(x))) else x)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

n_features = 1000
n_topics= 10

tfidf_vec = TfidfVectorizer(max_df = .95, min_df = 2, max_features = n_features, ngram_range = (2,3))
groups = df_text[group_col].unique()
results = []

for i in groups: 
	df_grp = df_text.loc[df_text[group_col] == i]
	if len(df_grp[text_col]) > 100:
		tf = tfidf_vec.fit_transform(df_grp[text_col])
		feature_names = tfidf_vec.get_feature_names()
		try:
			nmf = NMF(n_components = n_topics, random_state=1,alpha=.1, l1_ratio=.5).fit(tf)
			df_topics = pd.DataFrame(nmf.components_)
			df_topics.columns = feature_names
			df_top = df_topics.apply(lambda x: pd.Series(x.sort_values(ascending=False).iloc[:5].index,index=[‘top1’,’top2’,’top3’,’top4’,’top5’]), axis=1).reset_index()
			df_top[‘Group’] = i
			results.append(df_top)
		except:
			results.append(i+’ Did not produce topic results’)
topic_results = pd.concat(results,axis=0)
topic_results.to_csv(path+’my_NMF_results.csv’)
Human-Centered Data Science

Human-Centered Data Science

Data science, advanced analytics, machine learning, artificial intelligence, cognitive computing, and natural language processing are all buzz words popular in the business world today because so many use cases have demonstrated how leveraging these tools can lead to significant competitive advantages.

Despite the proven power in these tools, many still struggle with successful implementation, not because the tools are losing their power, but because many data science teams, vendors, and individuals fail to properly integrate the tools of data science within the context of human decision making.  Thus great data science products are built but their true impact is lost through the often irrational, biased, and difficult-to-predict humans who are tasked with using them.

This problem is not new, and what we explore in this post is the idea that we may be able to learn a thing or two from the past in order to develop new roadmaps for successful data science. Herein we look at different models that bridge products with people.

At the intersection of data science and human psychology, lies a multidisciplinary field that is ripe for implementation.

What is the design of everyday data science?

In 1988, Donald Norman published the book The Psychology of Everyday Things, which later turned into another book titled The Design of Everyday Things. The ideas contained in these books were simple, powerful, and disruptive because, prior to this time, no one had formalized how to merge engineering with human psychology. These books have inspired the field of User Design or UX, more formally known as Human-Centered Design (HCD).

Flash forward 30 years, and data science in many businesses may be failing in the same way that design failed before people started to actually incorporate the study of humans into design engineering. But the problem with data science goes beyond the design of everyday things because the products of data science are often not things. Rather they are insights, automations, and models of human skills and abilities. Thus, we must not only take ideas from HCD to improve the user experience with the products of data science but we must also leverage other disciplines to fully grasp a roadmap to successful data science implementation.

What should the design of everyday data science be?

Because the products of data science are increasingly integrated with things, be they refrigerators, toasters, cars, or applications, the design of everyday data science would indeed benefit from some of the principles of HCD that were the bedrock of Dr. Norman’s original ideas.

Before we get started, it is important to define a few key concepts (from Bruce Tognazzini’s extensive work on HCD):

  • Discoverability: “ensures that users can find out and understand what the system can do.”
  • Affordances: “A relationship between the properties of an object and the capabilities of the agent that determine just how the object could possibly be used.”
  • Signifiers: “Affordances determine what actions are possible. Signifiers communicate where the action should take place.”
  • Mappings: “Spatial correspondence between the layout of the controls and the devices being controlled.”
  • Feedback: Immediate reaction and appropriate amount of response.

To that end, we, as data scientists, must ensure discoverability in our products. We often fail here because we believe that insights derived from statistical models or advanced analytics are in and of themselves discoveries and so therefore are already discoverable. This assumption however is incorrect because insights are only as valuable as they are applicable to the business or user. Therefore, we must articulate what it means to deliver data science products that are more discoverable. This includes all the elements of discoverability including identifying affordances, signifiers, mappings, and opportunities for feedback.

A data science product is delivered in the context of interacting humans and is thus only as good as it allows users to discover how its affordances improve their experiences. An affordance is not an attribute of the product but rather a relationship between the user and the product (Norman, 1988). If a data science classification model replaces the need for someone to click through thousands of documents to find information then its affordances are time, improved quality, and augmented performance. These should be clearly discoverable through the way the product is delivered through documentation and signifiers.

Signifiers signal to users possible points of use that create affordances. In data science this can mean delivering key drivers with models so that users have clarity on why, in the case of the above example, different documents are being categorized, tagged, or labeled by the model. Doing so lends itself to the discovery of affordances such as improved quality and performance augmentation.

Mappings to Dr. Norman referred to how different design elements mapped to their design functions. For example, light switches map to light bulbs by enabling them to turn on or off. In data science we often map the function of models to their probabilities or decisions as 1’s and 0’s but for users, this is not typically intuitive and so therefore this mapping is not typically all that useful. Thus, we can adjust our mappings to include qualifiers that represent more intuitive application of our data science products. For example, probabilities become buckets of “High Risk,” “Moderate Risk,” and “Low Risk” value labels that improve the ability of users to map the outputs of our models to their functions.

In many ways optimal mappings will not be apparent until we have had the chance to obtain feedback from users. For business users feedback can be explicit and carried out in ways that follow the principles of good design (simple, easy, and unobtrusive). In the rare case where our users are actually customers of our insights (a model that predicts someone’s likelihood to get a job or their success in a relationship) then feedback must also be intuitive and responsive (see also below where we expand on responsive feedback design through voice).

More psychology!!!

But it is not enough to simply borrow concepts from HCD to improve the success of data science products. Because these products are deployed to interact with people, both customers and business users alike, our success pipeline must be sensitive to the political and social psychological relationships that define how these individuals interact with each other and our products.

For example, machines that deliver automation or even augmentation to a business user can feel threatening. The threats can be in the form of threats to job security or they can threaten one’s feelings of efficacy and expertise. Thus, our data science pipeline must be sensitive to this outcome by directly addressing feelings of threat in order to achieve buy-in. Social psychologists have long recognized that to increase buy-in, people need to feel as though a new process is fair, and to ensure fairness the change process requires voice. Voice is the opportunity granted to users to partake in how the process actually unfolds. From a data science perspective this means that we enable opportunities, not just for feedback as we learned from HCD but to demonstrate how that feedback actually created change in our product.

For example, explain the key model features to users and solicit feedback for different ways to group those features into meaningful and actionable groupings. In one such instance, a client had the idea to group features that could be affected via different outreach mediums (e.g. personal phone call, email nudge, etc.). By incorporating this feedback into the product, users were already thinking about how to creatively develop content that could address these differences when they saw those with high probabilities (e.g. risk scores) along with key drivers that better matched different modes of outreach. Users saw the affordances because they were now an active participant in using the product to improve their own impact.

But voice is not the only perspective in psychology that can help to develop a successful data science product pipeline. Indeed, one could incorporate concepts from political psychology or motivation to understand the relational aspects of their products success. We leave this to the imagination and creativity of you, the reader. Feel free to comment below on ideas to continue this conversation and push the envelope further in pursuing more effective models for data science success.

End-to-end success checklist

A useful checklist to consider in developing a successful data science pipeline might look something like this:

  • What characteristics make up the primary user groups for this product?
  • How do those characteristics suggest different possible affordances of my product? What does my product enable or prevent (anti-affordances) for those specific users?
  • What delivery or deployment method makes the most sense to achieve these affordances?
  • How do I signal these affordances to my user base?
  • What mappings make the most sense from my users’ perspective?
  • Am I providing opportunities for feedback that are simple, easy, and unobtrusive?
  • Can I demonstrate how the feedback has changed the product?

This concludes our post on successful data science product pipelines. We appreciate you taking the time to read this and look forward to seeing your continued ideas in the comments below. Although this post was high-level and rather theoretical, stay tuned as we will be including future topics that explore more practical issues in coding for data science and human decision making.

I would also emphasize that this is merely one application that attempts to merge different fields but there are many other approaches. The key is to recognize the value of cross pollination from fields as diverse as data science, data engineering, app development, user-experience, and psychology. Cheers!

References

Norman, D. (2013). The design of everyday things: Revised and expanded edition. Constellation.

Tognazzini, B. (2014). First Principles of Interaction Design (Revised & Expanded).  AskTog.