Is Data Science Dead?

Is Data Science Dead?

We have all heard it, or read about it, or both. The data scientist is dying and there is little we can do to hold on to our cushy salaries, rock-star-like images, and inflated egos. Obviously, I am overstating things here for dramatic effect, but the message is still anxiety-provoking for many data science professionals who have begun to smell blood in their industrial waters as concepts like “citizen data scientist,” “democratization of analytics,” and “automated machine learning” are being thrown around by more and more executive teams. Such fears were stoked earlier this year when Matt Tucker’s article “The Death of the Data Scientist,” was published on Data Science Central and Justin Dickerson’s on LinkedIn, though neither were the first to make such a claim. But are data scientists as we know them today, truly a breed bound for extinction? In the remainder of this post, I explore this idea while offering an alternative perspective on what the future may look like for the current data science professional.

The Rise of the Machines

At its most fundamental level, the argument goes something like this; many of the activities of the data scientist are quantifiable or statistical in nature and are, as a result, automatable. Therefore, the better we can orchestrate statistical models together in an automated fashion the less need there is for a data scientist to be pulling on those levers that select, optimize, and deploy their data-driven insights. Indeed, companies and products such as DataRobot, Google’s AutoML, and the ever-expanding access to pre-trained, service-based data science models (Azure Cognitive Services, Google Ai Services, AWS, Watson) have made significant strides to achieve just that, an artificial data scientist.

From Rise to Ride

Despite this dire prognosis of a field that was always poorly defined anyway, those who claim the title of data scientist have nonetheless developed ample skill to evolve with the coming wave of artificial data science. Thus, we must replace our conjured-up images of the Skynets of the world rising to overthrow the last remaining strongholds of human data scientists, with images of explorers riding the hype-wave of artificial intelligence technologies that are fundamentally embedded with skills that only the human data scientist truly understands. To achieve such evolution, there are three areas the practicing data scientist must focus on and that the employer of the future must inspire; an ever evolving/expanding toolkit, the importance of the user experience, and the evangelism of a trade.

The Ever Evolving/Expanding Toolkit

If there is one thing that data scientists are good at, it’s catching a buzz (and a few new tools along the way). The concept of data science itself is a buzz term that many professionals with any statistical understanding in business attached to themselves in order to improve their marketability, and to good effect. Why should we expect the building wave of artificial intelligence to be any different? As the concept of data scientist has evolved, so too have the tools associated with it, and thus the professionals in this field have been caught in a constant race to remain relevant by exposing themselves to the newest tools being made available. Although the rate of change has been near to overwhelming, those who have survived and been able to demonstrate competence around the core functionality of these data science technologies are well poised to take advantage of the tools of artificial intelligence. Thus, the data scientists who learn to evolve will learn how to rebrand themselves as practitioners of artificial intelligence. But to be able to convince others of this rebrand, such professionals will need to continue to expand on their toolkits. Whereas the early 2000’s brought us Hadoop, NoSQL, IoT, Python’s scikit-learn, Tensorflow, and Spark the next generation will be leveraging Ai-as-a-Service, cloud computing, intelligent automation, and containerization for analytics. This means that data scientists must continue to learn how to leverage API calls, architect cloud environments that support data science, and deploy analytics to expose API endpoints.

The Importance of the User Experience

As you can see from above, statistical tools are not the only tools that will help data scientists to survive in this quickly changing landscape. Artificial intelligence is not merely statistical technologies but rather it is the embedding of those statistical technologies into user experiences. Thus, the savvy data science survivalist will identify opportunities to solve problems using embedded statistical analytics. Such efforts will require a greater understanding of software programming concepts, which the data scientist is already well-poised for through the acquisition of open source scripting tools, and the ability to work more closely with application development teams. There are many ways to tackle the user experience problem from both a technical as well as a theoretical (see our previous blog post as one example) perspective and what works will always depend on satisfying the user but the key is to identify strategies whereby statistical models improve the user experience. In this way data scientists will need to continue to evolve their approach to problem solving. Where once we focused on using cutting edge modeling techniques to extract insights from data, we now need to focus on their utility within an application.

Evangelizing a Trade

And finally, because the true test of our data science products depends on the user’s ability to get value from them, we must be prepared to take our specialized understanding of these Ai-enabling technologies and empower the citizen data scientist rather than pontificate over the sacredness of our special anointed knowledge. Despite the apparent ease-of-use promised by the onslaught of automated data science products, citizen data scientists will still lack understanding of their application. As one Reddit user so elegantly put it “most people can barely use Excel, and even most data/business analysts have a hard time understanding anything beyond basic aggregation and statistics”. Thus, businesses will look to data scientists to train the citizen data scientist of the future to use those tools as use cases permit. The reason that data scientists will be required is because data science is not a tool but rather it is a way of thinking and tackling problems. Tools certainly enable new ways of thinking, but people need to be trained on how to think about the tool in order for the tool to change their approach to solving problems. In short, we must evangelize the tools that enable the artificial data scientist. In this vain, data scientists become the hub of both artificial and human data science products within an organization and the citizen data scientists the spokes.

From Data Scientist to Ai Practitioner

 

In conclusion, the data scientist is not dead, or dying for that matter, but is, instead, in need of a coming evolution. Those who are most successful in continuing to expand their tool kits to leverage Ai services, expose results to and interact with applications, and impart their way of thinking to enable others will be the most confidently poised to meet the coming needs of the Ai practitioner for the future of digital enterprise.

Text Analytics: Topic Analysis with Python

Text Analytics: Topic Analysis with Python

Ever notice how the world is turning more semantic? Instead of directing a mouse or even a finger to instruct a piece of technology to do something, we are now able to simply instruct with our voices. Devices such as Alexa or Google Home are becoming increasingly integrated with other technologies creating new opportunities for semantics to dictate how those technologies operate. But how does a machine, like a computer or an Alexa device know what I am saying? The short answer is through text analytics.

Even the business world is becoming increasingly aware of the power of information “hidden” in the hundreds, thousands, millions, or even billions of text-based data points. In fact, one of the most common requests I get from business users goes something like this:

User: I know that staff members take note of those interactions with customers, but I don’t have any visibility over what they are putting in those notes. I want to know what they do. Maybe what they are taking note of has important implications for my business process.

In this blog post we will explore some fundamental processes involved in text analytics that will help us to help business requests like the one described above. The goal of this post will be to:

• Develop code to pre-process text data
• Perform topic analysis by groups
• Organize our results into a meaningful presentation
• Full code found at the bottom

Just a few additional notes on the code below. It is often the case that we want to discover topics that are hidden in text in more than one context. For example, maybe you want to know what males are talking about when compared to what females are talking about, or any other grouping for that matter. Thus, the example below explores topic analysis of text data by groups. This also differentiates this blog from other, excellent blogs, on the more general topic of text topic analysis.

Before starting, it is important to note just a few things regarding the environment we are working and coding in:

• Python 3.6 Running on a Linux machine
• NLTK 3.2.5
• Pandas 0.22.0
• Numpy 1.14.0
• Sklearn 0.19.1

Preprocessing Text Data

To begin, we must start with some text data. Because we are working from a business use case we will assume that the data are contained in some type of relational table (i.e. RDBMS) or csv format. Either way, the first step will be to get the text data loaded into memory using Python’s ever-powerful Pandas library. For this example I will read from a csv but there is a lot of really good documentation for querying from RDMS systems too (see pandas.read_sql()).

import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import sent_tokenize
from nltk.corpus import wordnet
from nltk import post_tag, WordNetLemmatizer

path = ‘/path/to/csv’
file = ‘csvfiletoload.csv’

df = pd.read_csv(path+file,encoding=’latin-1’)

In the code above, we first import necessary libraries. Then, in the second chunk, we declare a path variable that allows us to use the same path for reading and writing input/output. We also declare a file variable that will be the name of the csv file we want to bring into memory.

Finally, we define a Data Frame (df) that reads our csv file into memory as a Pandas Data Frame. But why the “encoding = ‘latin-1’” in the code? When dealing with text data different programs will save the data with different underlying encodings to preserve symbols that may not be available in all encoding schemes. This is particularly important for text data. Pandas assumes a limited set of encoding schemes when loading data from csv or text files and sometimes with text data, we get encoding errors. For a list of different encodings that can be specified with text file loads in Pandas see the encodings for Python here.

pd.options.mode.chained_assignment = None

I have added the code above as a way of dealing with writing over data frames as we clean our data. Anytime we want to process a data frame with some cleaning algorithm, we may not care about the old data and so will want to simply write-over the old data frame. Although it is not always a best practice to write over old data with new data, it is often more efficient for memory and so I simply suggest using your own discretion. Using the above code will stop Pandas from printing a warning to this effect.

text_col = ‘Your text column’
group_col = ‘Group Column’
df_text = df[[group_col, text_col]]

df_text[text_col] = df_text[text_col].replace(to_replace=r’[ , | ? | $ | . | ! | - | : ]’ , value = r’’, regex = True)
df_text[text_col] = df_text[text_col].replace(to_replace=r’[ ^a-zA-Z ]’ , value = r’ ’, regex = True)
df_text[text_col] = df_text[text_col].replace(to_replace=r’\s\s+’ , value = r’ ’, regex = True)

In the next set of code we first identify the column that contains our grouper like gender. We then identify our column that contains our text data. Once we have identified our two most important columns we create a new data frame of just those columns called df_text.

Finally, I have included 3 different sets of code for doing some initial processing of the text data using Regex functions. The first function replaces funny symbols with nothing in order to remove funny symbols from analysis. You can add more symbols that may be unique to your data set by adding a | and then the symbol after. The second regex function replaces all non-letters with a space. The last regex pattern removes extra blank spaces and replaces them with a single space to ensure that each word only contains one space to the next word. These obviously have overlapping effects so use one, all, or modify to your specific needs.

wnl = WordNetLemmatizer()
stop = set(nltk.corpus.stopwords.words(‘english’)
operators = set([‘not’,’n/a’,’na’])
stopwords = stop – operators

def remove_stopwords(tokens, stopwords):
	return [token for token in tokens if token not in stopwords]

def get_wordnet_pos(treebank_tag):
	if treebank_tag.startswith(‘J’):
		return wordnet.ADJ
	if treebank_tag.startswith(‘V’):
		return wordnet.VERB
	if treebank_tag.startswith(‘N’):
		return wordnet.NOUN
	if treebank_tag.startswith(‘R’):
		return wordnet.ADV
	else:
		return ‘n’

def lemmarati(tup_list):
	if not (np.all(pd.notnull(tup_list))):
		return tup_list
	outputlist = []
	for i, j in tup_list:
		pos = get_wordnet_pos(i,pos)
		lemma = wnl.lemmatize(i,pos)
		outputlist.append(lemma)
	return outputlist

In the next set of code, we are activating and setting up some functions that will allow us to do some more cleaning and normalizing of the text data. More specifically, the code sets up a function to remove stopwords, or words that are very common and as a result not all that meaningful (e.g. the). The remaining code also performs lemmatization. Lemmatization is a way of normalizing text so that words like Python, Pythons, and Pythonic all become just Python. Thus, lemmatization is like stemming but it takes the part of speech into account so that meet (v) and meeting (n) are kept separate.

Also, note that before defining our stopword list we remove some words that we want to keep in our topic analysis. Words like ‘not’ although often considered a stopword, can be very important when performing topic or sentiment analysis. Consider the difference between ‘happy’ and ‘not happy.’ The latter is the opposite of the former however if we used the nltk stopwords list we would remove ‘not’ from the list and run the risk of thinking most comments were ‘happy’ when in reality they were ‘not happy.’

I wanted to explain the code before invoking the functions, which we do below.

df_text[text_col] = df_text[text_col].map(lambda x: nltk.word_tokenize(x.lower()) if (np.all(pd.notnull(x))) else x.lower())

df_text[text_col] = df_text[text_col].map(lambda x: pos_tag(x) if (np.all(pd.notnull(x))) else x)

df_text[text_col] = df_text[text_col].map(lemmarati)

df_text[text_col] = df_text[text_col].map(lambda x: remove_stopwords(x,stopwords) if (np.all(pd.notnull(x))) else x)

In the code above we invoke the functions we created in the previous code block. The first line tokenizes (identify individual words) our text strings creating lists of word tokens. The next two sets of code perform parts of speech tagging (pos_tag) and then return the lemma for each word (lemmarati [I must have been inspired by the illuminati when I wrote this function 😉]). The last set removes the stopwords.

df_text[text_col] = df_text[text_col].map(lambda x: ‘ ‘.join(x) if (np.all(pd.notnull(x))) else x)

I include just one line of code because it is the last thing we will do before we move on to the topic analysis. When using scikit-learn to perform topic analysis we need to make sure we are submitting a string and not a list of word tokens. Thus, the above code gets the text back into sentence form, if you will. And now we are ready for topic analysis.

Topic Analysis using NMF (or LDA)

In the next section we perform Non-Negative Matrix Factorization (NMF), which can be thought of as similar to factor analysis for my behavioral science audience. Essentially, you first create a term document frequency matrix and then look for those terms that tend to show up together in documents at a higher frequency than other terms thus creating topics. This is a very high-level way of explaining the set of rather complicated algorithms ‘under the covers’ of this analysis but it should give you a good sense of how to interpret the results.

Another common and quite popular algorithm for topic analysis is Latent Dirichlet Allocation (LDA). Although quite popular, I often find NMF to produce better results with smaller data sets. Thus, we focus on NMF below but note that scikit-learn has a great example comparing both methods here where you will notice that the syntax for NMF is nearly identical to that of LDA.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

n_features = 1000
n_topics= 10

tfidf_vec = TfidfVectorizer(max_df = .95, min_df = 2, max_features = n_features, ngram_range = (2,3))

In the above code we start by importing the necessary libraries for topic analysis. In the next chunk we declare variables to limit our results to 1000 features and 10 topics. These settings can be adjusted to your preference. In the final set of code we activate the TfidfVectorizer with some important parameters. First, you may be asking what is Tfidf? This refers to Term-Frequency-Inverse-Document-Frequency. This of Tfidf as a weight for each word that represents how important the word is for each document in the corpus of documents. It is a slightly fancier way of creating a term frequency matrix. From Wikipedia “The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.”

Second, as you examine the parameters in the code I want to emphasize the ‘ngram_range(2,3)’ part. This tells the computer to create a tf-idf matrix of both 2 (bigram) and 3 (trigram) word pairs. In other words, we are asking the matrix to capture phrases, which can be quite meaningful for topic analysis.

Now, this next set of code performs the NMF analysis by group and saves the results as a list of data frames containing the top 5 bi and tri-grams for each of the 10 topics for each group. It is probably the least pythonic but lends itself to easy reporting to end users.

groups = df_text[group_col].unique()
results = []

for i in groups: 
	df_grp = df_text.loc[df_text[group_col] == i]
	if len(df_grp[text_col]) > 100:
		tf = tfidf_vec.fit_transform(df_grp[text_col])
		feature_names = tfidf_vec.get_feature_names()
		try:
			nmf = NMF(n_components = n_topics, random_state=1,alpha=.1, l1_ratio=.5).fit(tf)
			df_topics = pd.DataFrame(nmf.components_)
			df_topics.columns = feature_names
			df_top = df_topics.apply(lambda x: pd.Series(x.sort_values(ascending=False).iloc[:5].index,index=[‘top1’,’top2’,’top3’,’top4’,’top5’]), axis=1).reset_index()
			df_top[‘Group’] = i
			results.append(df_top)
		except:
			results.append(i+’ Did not produce topic results’)

In the code above, we first get a list of the unique groups in our grouping column. We then create a container (in this case a list) to hold our resulting data frames from the NMF topic analysis.

In the for loop, we perform a separate NMF analysis for each unique group contained in the grouping column. We use the ‘if len(df_grp[text_col]) > 100’ logic to ensure we have enough rows of text for the analysis. We use the ‘try:’ statement to ensure that the analysis will still run in case one of the groups gives us an error. In the ‘try:’ code we perform the NMF, extract the components into a data frame, label the data frame with the feature names (the bi and trigrams), selecting only the top 5 bi and trigrams for each topic based on their numeric contribution to the topic, add a column to the data frame to keep track of which group the topics are for, and append the results into our results list.

Now we have a list of data frames, which are not useful as a list so one more step before we finish.

topic_results = pd.concat(results,axis=0)
topic_results.to_csv(path+’my_NMF_results.csv’)

Now we are done. The ‘my_NMF_results.csv’ file now contains a nicely organized table of 10 topics by group showing the top 5 bi and trigrams that can help you to understand the business meaning of the topic. Your results should look something like this:

Stay tuned for future blogs where we will use the results of our topic analysis to score new text by topic, preform sentiment analysis, topic classification, and other analytics that will help us to meet the challenges when dealing with text data.

Feel free to add comments or questions. Be kind and respectful as unkind or disrespectful posts will be removed.

Full Code Set

import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import sent_tokenize
from nltk.corpus import wordnet
from nltk import post_tag, WordNetLemmatizer

path = ‘/path/to/csv’
file = ‘csvfiletoload.csv’

df = pd.read_csv(path+file,encoding=’latin-1’)
pd.options.mode.chained_assignment = None
text_col = ‘Your text column’
group_col = ‘Group Column’
df_text = df[[group_col, text_col]]

df_text[text_col] = df_text[text_col].replace(to_replace=r’[ , | ? | $ | . | ! | - | : ]’ , value = r’’, regex = True)
df_text[text_col] = df_text[text_col].replace(to_replace=r’[ ^a-zA-Z ]’ , value = r’ ’, regex = True)
df_text[text_col] = df_text[text_col].replace(to_replace=r’\s\s+’ , value = r’ ’, regex = True)
wnl = WordNetLemmatizer()
stop = set(nltk.corpus.stopwords.words(‘english’)
operators = set([‘not’,’n/a’,’na’])
stopwords = stop – operators

def remove_stopwords(tokens, stopwords):
	return [token for token in tokens if token not in stopwords]

def get_wordnet_pos(treebank_tag):
	if treebank_tag.startswith(‘J’):
		return wordnet.ADJ
	if treebank_tag.startswith(‘V’):
		return wordnet.VERB
	if treebank_tag.startswith(‘N’):
		return wordnet.NOUN
	if treebank_tag.startswith(‘R’):
		return wordnet.ADV
	else:
		return ‘n’

def lemmarati(tup_list):
	if not (np.all(pd.notnull(tup_list))):
		return tup_list
	outputlist = []
	for i, j in tup_list:
		pos = get_wordnet_pos(i,pos)
		lemma = wnl.lemmatize(i,pos)
		outputlist.append(lemma)
	return outputlist
df_text[text_col] = df_text[text_col].map(lambda x: nltk.word_tokenize(x.lower()) if (np.all(pd.notnull(x))) else x.lower())

df_text[text_col] = df_text[text_col].map(lambda x: pos_tag(x) if (np.all(pd.notnull(x))) else x)

df_text[text_col] = df_text[text_col].map(lemmarati)

df_text[text_col] = df_text[text_col].map(lambda x: remove_stopwords(x,stopwords) if (np.all(pd.notnull(x))) else x)
df_text[text_col] = df_text[text_col].map(lambda x: ‘ ‘.join(x) if (np.all(pd.notnull(x))) else x)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

n_features = 1000
n_topics= 10

tfidf_vec = TfidfVectorizer(max_df = .95, min_df = 2, max_features = n_features, ngram_range = (2,3))
groups = df_text[group_col].unique()
results = []

for i in groups: 
	df_grp = df_text.loc[df_text[group_col] == i]
	if len(df_grp[text_col]) > 100:
		tf = tfidf_vec.fit_transform(df_grp[text_col])
		feature_names = tfidf_vec.get_feature_names()
		try:
			nmf = NMF(n_components = n_topics, random_state=1,alpha=.1, l1_ratio=.5).fit(tf)
			df_topics = pd.DataFrame(nmf.components_)
			df_topics.columns = feature_names
			df_top = df_topics.apply(lambda x: pd.Series(x.sort_values(ascending=False).iloc[:5].index,index=[‘top1’,’top2’,’top3’,’top4’,’top5’]), axis=1).reset_index()
			df_top[‘Group’] = i
			results.append(df_top)
		except:
			results.append(i+’ Did not produce topic results’)
topic_results = pd.concat(results,axis=0)
topic_results.to_csv(path+’my_NMF_results.csv’)
Human-Centered Data Science

Human-Centered Data Science

Data science, advanced analytics, machine learning, artificial intelligence, cognitive computing, and natural language processing are all buzz words popular in the business world today because so many use cases have demonstrated how leveraging these tools can lead to significant competitive advantages.

Despite the proven power in these tools, many still struggle with successful implementation, not because the tools are losing their power, but because many data science teams, vendors, and individuals fail to properly integrate the tools of data science within the context of human decision making.  Thus great data science products are built but their true impact is lost through the often irrational, biased, and difficult-to-predict humans who are tasked with using them.

This problem is not new, and what we explore in this post is the idea that we may be able to learn a thing or two from the past in order to develop new roadmaps for successful data science. Herein we look at different models that bridge products with people.

At the intersection of data science and human psychology, lies a multidisciplinary field that is ripe for implementation.

What is the design of everyday data science?

In 1988, Donald Norman published the book The Psychology of Everyday Things, which later turned into another book titled The Design of Everyday Things. The ideas contained in these books were simple, powerful, and disruptive because, prior to this time, no one had formalized how to merge engineering with human psychology. These books have inspired the field of User Design or UX, more formally known as Human-Centered Design (HCD).

Flash forward 30 years, and data science in many businesses may be failing in the same way that design failed before people started to actually incorporate the study of humans into design engineering. But the problem with data science goes beyond the design of everyday things because the products of data science are often not things. Rather they are insights, automations, and models of human skills and abilities. Thus, we must not only take ideas from HCD to improve the user experience with the products of data science but we must also leverage other disciplines to fully grasp a roadmap to successful data science implementation.

What should the design of everyday data science be?

Because the products of data science are increasingly integrated with things, be they refrigerators, toasters, cars, or applications, the design of everyday data science would indeed benefit from some of the principles of HCD that were the bedrock of Dr. Norman’s original ideas.

Before we get started, it is important to define a few key concepts (from Bruce Tognazzini’s extensive work on HCD):

  • Discoverability: “ensures that users can find out and understand what the system can do.”
  • Affordances: “A relationship between the properties of an object and the capabilities of the agent that determine just how the object could possibly be used.”
  • Signifiers: “Affordances determine what actions are possible. Signifiers communicate where the action should take place.”
  • Mappings: “Spatial correspondence between the layout of the controls and the devices being controlled.”
  • Feedback: Immediate reaction and appropriate amount of response.

To that end, we, as data scientists, must ensure discoverability in our products. We often fail here because we believe that insights derived from statistical models or advanced analytics are in and of themselves discoveries and so therefore are already discoverable. This assumption however is incorrect because insights are only as valuable as they are applicable to the business or user. Therefore, we must articulate what it means to deliver data science products that are more discoverable. This includes all the elements of discoverability including identifying affordances, signifiers, mappings, and opportunities for feedback.

A data science product is delivered in the context of interacting humans and is thus only as good as it allows users to discover how its affordances improve their experiences. An affordance is not an attribute of the product but rather a relationship between the user and the product (Norman, 1988). If a data science classification model replaces the need for someone to click through thousands of documents to find information then its affordances are time, improved quality, and augmented performance. These should be clearly discoverable through the way the product is delivered through documentation and signifiers.

Signifiers signal to users possible points of use that create affordances. In data science this can mean delivering key drivers with models so that users have clarity on why, in the case of the above example, different documents are being categorized, tagged, or labeled by the model. Doing so lends itself to the discovery of affordances such as improved quality and performance augmentation.

Mappings to Dr. Norman referred to how different design elements mapped to their design functions. For example, light switches map to light bulbs by enabling them to turn on or off. In data science we often map the function of models to their probabilities or decisions as 1’s and 0’s but for users, this is not typically intuitive and so therefore this mapping is not typically all that useful. Thus, we can adjust our mappings to include qualifiers that represent more intuitive application of our data science products. For example, probabilities become buckets of “High Risk,” “Moderate Risk,” and “Low Risk” value labels that improve the ability of users to map the outputs of our models to their functions.

In many ways optimal mappings will not be apparent until we have had the chance to obtain feedback from users. For business users feedback can be explicit and carried out in ways that follow the principles of good design (simple, easy, and unobtrusive). In the rare case where our users are actually customers of our insights (a model that predicts someone’s likelihood to get a job or their success in a relationship) then feedback must also be intuitive and responsive (see also below where we expand on responsive feedback design through voice).

More psychology!!!

But it is not enough to simply borrow concepts from HCD to improve the success of data science products. Because these products are deployed to interact with people, both customers and business users alike, our success pipeline must be sensitive to the political and social psychological relationships that define how these individuals interact with each other and our products.

For example, machines that deliver automation or even augmentation to a business user can feel threatening. The threats can be in the form of threats to job security or they can threaten one’s feelings of efficacy and expertise. Thus, our data science pipeline must be sensitive to this outcome by directly addressing feelings of threat in order to achieve buy-in. Social psychologists have long recognized that to increase buy-in, people need to feel as though a new process is fair, and to ensure fairness the change process requires voice. Voice is the opportunity granted to users to partake in how the process actually unfolds. From a data science perspective this means that we enable opportunities, not just for feedback as we learned from HCD but to demonstrate how that feedback actually created change in our product.

For example, explain the key model features to users and solicit feedback for different ways to group those features into meaningful and actionable groupings. In one such instance, a client had the idea to group features that could be affected via different outreach mediums (e.g. personal phone call, email nudge, etc.). By incorporating this feedback into the product, users were already thinking about how to creatively develop content that could address these differences when they saw those with high probabilities (e.g. risk scores) along with key drivers that better matched different modes of outreach. Users saw the affordances because they were now an active participant in using the product to improve their own impact.

But voice is not the only perspective in psychology that can help to develop a successful data science product pipeline. Indeed, one could incorporate concepts from political psychology or motivation to understand the relational aspects of their products success. We leave this to the imagination and creativity of you, the reader. Feel free to comment below on ideas to continue this conversation and push the envelope further in pursuing more effective models for data science success.

End-to-end success checklist

A useful checklist to consider in developing a successful data science pipeline might look something like this:

  • What characteristics make up the primary user groups for this product?
  • How do those characteristics suggest different possible affordances of my product? What does my product enable or prevent (anti-affordances) for those specific users?
  • What delivery or deployment method makes the most sense to achieve these affordances?
  • How do I signal these affordances to my user base?
  • What mappings make the most sense from my users’ perspective?
  • Am I providing opportunities for feedback that are simple, easy, and unobtrusive?
  • Can I demonstrate how the feedback has changed the product?

This concludes our post on successful data science product pipelines. We appreciate you taking the time to read this and look forward to seeing your continued ideas in the comments below. Although this post was high-level and rather theoretical, stay tuned as we will be including future topics that explore more practical issues in coding for data science and human decision making.

I would also emphasize that this is merely one application that attempts to merge different fields but there are many other approaches. The key is to recognize the value of cross pollination from fields as diverse as data science, data engineering, app development, user-experience, and psychology. Cheers!

References

Norman, D. (2013). The design of everyday things: Revised and expanded edition. Constellation.

Tognazzini, B. (2014). First Principles of Interaction Design (Revised & Expanded).  AskTog.