Essential Statistics for Data Science: A Case Study using Python, Part I
System requirements: Language: Python 3.5
Libraries:statsmodels, pandas, matplotlib
Data: available here
Our last post dove straight into linear regression. In this post, we'll take a step back to cover essential statistics that every data scientist should know. To demonstrate these essentials, we'll look at a hypothetical case study involving an administrator tasked with improving school performance in Tennessee.
Note, this tutorial is intended to serve solely as an educational tool and not as a scientific explanation of the causes of various school outcomes in Tennessee.
Table of contents
Meet Sally, a public school administrator. Some schools in her state of Tennessee are performing below average academically. Her superintendent, under pressure from frustrated parents and voters, approached Sally with the task of understanding why these schools are under-performing. Not an easy problem, to be sure.
To improve school performance, Sally needs to learn more about these schools and their students, just as a business needs to understand its own strengths and weaknesses and its customers.
Though Sally is eager to build an impressive explanatory model, she knows the importance of conducting preliminary research to prevent possible pitfalls or blind spots. Thus, she engages in a thorough exploratory analysis, which includes: a lit review, data collection, descriptive and inferential statistics, and data visualization.
Sally has strong opinions as to why some schools are under-performing, but opinions won't do, nor will a handful of facts; she needs rigorous statistical evidence.
Sally conducts a lit review, which involves reading a variety of credible sources to familiarize herself with the topic. Most importantly, Sally keeps an open mind and embraces a scientific world view to help her resist confirmation bias (seeking solely to confirm one's own world view).
In Sally's lit review, she finds multiple compelling explanations of school performance: curriculae, income, and parental involvement. These sources will help Sally select her model and data, and will guide her interpretation of the results.
The data we want isn't always available (see here and here), but Sally lucks out and finds student performance data based on test scores () for every public school in middle Tennessee. The data also includes various demographic, school faculty, and income variables (see readme for more information). Satisfied with this dataset, she writes a web-scraper to retrieve the data.
But data alone can't help Sally; she needs to convert the data into useful information.
Descriptive and Inferential Statistics
Sally opens her stats textbook and finds that there are two major types of statistics, descriptive and inferential.
Descriptive statistics identify patterns in the data, but they don't allow for making hypotheses about the data.
Within descriptive statistics, there are two measures used to describe the data: central tendency and deviation. Central tendency refers to the central position of the data (mean, median, mode) while the deviation describes how far spread out the data are from the mean. Deviation is most commonly measured with the standard deviation (see here for more on standard deviation). A small standard deviation indicates the data are close to the mean, while a large standard deviation indicates that the data are more spread out from the mean.
Inferential statistics allow us to make hypotheses (or inferences) about a sample that can be applied to the population. For Sally, this involves developing a hypothesis about her sample of middle Tennessee schools and applying it to her population of all schools in Tennessee.
For now, Sally puts aside inferential statistics and digs into descriptive statistics.
To begin learning about the sample, Sally uses pandas' method, as seen below. The column headers in bold text represent the variables Sally will be exploring. Each row header represents a descriptive statistic about the corresponding column.
Looking at the output above, Sally's variables can be put into two classes: measurements and indicators.
Measurements are variables that can be quantified. All data in the output above are measurements. Some of these measurements, such as , and , are outcomes; these outcomes cannot be used to explain one another. For example, explaining as a result of (test scores) is circular logic. Therefore we need a second class of variables.
The second class, indicators, are used to explain our outcomes. Sally chooses indicators that describe the student body (for example, ) or school administration () hoping they will explain .
Sally sees a pattern in one of the indicators, . is a variable measuring the average percentage of students per school enrolled in a federal program that provides lunches for students from lower-income households. In short, is a good proxy for household income, which Sally remembers from her lit review was correlated with school performance.
Sally isolates and groups the data by using pandas' method and then uses on the re-shaped data (see below).
Below is a discussion of the metrics from the table above and what each result indicates about the relationship between and :
count: the number of schools at each rating. Most of the schools in Sally's sample have a 4- or 5-star rating, but 25% of schools have a 1-star rating or below. This confirms that poor school performance isn't merely anecdotal, but a serious problem that deserves attention.
mean: the average percentage of students on among all schools by each . As school performance increases, the average number of students on reduced lunch decreases. Schools with a 0-star rating have 83.6% of students on reduced lunch. And on the other end of the spectrum, 5-star schools on average have 21.6% of students on reduced lunch. We'll examine this pattern further. in the graphing section.
std: the standard deviation of the variable. Referring to the of 0, a standard deviation of 8.813498 indicates that 68.2% (refer to readme) of all observations are within 8.81 percentage points on either side of the average, 83.6%. Note that the standard deviation increases as increases, indicating that loses explanatory power as school performance improves. As with the mean, we'll explore this idea further in the graphing section.
min: the minimum value of the variable. This represents the school with the lowest percentage of students on reduced lunch at each school rating. For 0- and 1-star schools, the minimum percentage of students on reduced lunch is 53%. The minimum for 5-star schools is 2%. The minimum value tells a similar story as the mean, but looking at it from the low end of the range of observations.
25%: the bottom quartile; represents the lowest 25% of values for the variable, . For 0-star schools, 25% of the observations are less than 79.5%. Sally sees the same trend in the bottom quartile as the above metrics: as increases the bottom 25% of decreases.
50%: the second quartile; represents the lowest 50% of values. Looking at the trend in and , the same relationship is present here.
75%: the top quartile; represents the lowest 75% of values. The trend continues.
max: the maximum value for that variable. You guessed it: the trend continues!
The descriptive statistics consistently reveal that schools with more students on reduced lunch under-perform when compared to their peers. Sally is on to something.
Sally decides to look at from another angle using a correlation matrix with pandas' method. The values in the correlation matrix table will be between -1 and 1 (see below). A value of -1 indicates the strongest possible negative correlation, meaning as one variable decreases the other increases. And a value of 1 indicates the opposite. The result below, -0.815757, indicates strong negative correlation between and . There's clearly a relationship between the two variables.
Sally continues to explore this relationship graphically.
Essential Graphs for Exploring Data
In her stats book, Sally sees a box-and-whisker plot. A box-and-whisker plot is helpful for visualizing the distribution of the data from the mean. Understanding the distribution allows Sally to understand how far spread out her data is from the mean; the larger the spread from the mean, the less robust is at explaining .
See below for an explanation of the box-and-whisker plot.
Now that Sally knows how to read the box-and-whisker plot, she graphs to see the distributions. See below.
In her box-and-whisker plots, Sally sees that the minimum and maximum values tend to get closer to the mean as decreases; that is, as decreases so does the standard deviation in .
What does this mean?
Starting with the top box-and-whisker plot, as decreases, becomes a more powerful way to explain outcomes. This could be because as parents' incomes decrease they have fewer resources to devote to their children's education (such as, after-school programs, tutors, time spent on homework, computer camps, etc) than higher-income parents. Above a 3-star rating, more predictors are needed to explain due to an increasing spread in .
Having used box-and-whisker plots to reaffirm her idea that household income and school performance are related, Sally seeks further validation.
To further examine the relationship between and , Sally graphs the two variables on a scatter plot. See below.
In the scatter plot above, each dot represents a school. The placement of the dot represents that school's rating (Y-axis) and the percentage of its students on reduced lunch (x-axis).
The downward trend line shows the negative correlation between and (as one increases, the other decreases). The slope of the trend line indicates how much decreases as increases. A steeper slope would indicate that a small change in has a big impact on while a more horizontal slope would indicate that the same small change in has a smaller impact on .
Sally notices that the scatter plot further supports what she saw with the box-and-whisker plot: when increases, decreases. The tighter spread of the data as declines indicates the increasing influence of . Now she has a hypothesis.
Sally is ready to test her hypothesis: a negative relationship exists between and (to be covered in a follow up article). If the test is successful, she'll need to build a more robust model using additional variables. If the test fails, she'll need to re-visit her dataset to choose other variables that possibly explain . Either way, Sally could benefit from an efficient way of assessing relationships among her variables.
An efficient graph for assessing relationships is the correlation matrix, as seen below; its color-coded cells make it easier to interpret than the tabular correlation matrix above. Red cells indicate positive correlation; blue cells indicate negative correlation; white cells indicate no correlation. The darker the colors, the stronger the correlation (positive or negative) between those two variables.
With the correlation matrix in mind as a future starting point for finding additional variables, Sally moves on for now and prepares to test her hypothesis.
Sally was approached with a problem: why are some schools in middle Tennessee under-performing? To answer this question, she did the following:
- Conducted a lit review to educate herself on the topic.
- Gathered data from a reputable source to explore school ratings and characteristics of the student bodies and schools in middle Tennessee.
- The data indicated a robust relationship between and .
- Explored the data visually.
- Though satisfied with her preliminary findings, Sally is keeping her mind open to other explanations.
- Developed a hypothesis: a negative relationship exists between and .
In a follow up article, Sally will test her hypothesis. Should she find a satisfactory explanation for her sample of schools, she will attempt to apply her explanation to the population of schools in Tennessee.
Everyone loves a good story, especially when that story has a happy ending. What can make a great story even more compelling? When that story is true.
As marketers, we’ve been taught to eloquently boast about features, benefits and results. In some industries, the marketing rhetoric and jargon has become so commonplace that it often loses any power it may have once had, no matter how true the claims may be.
We so easily forget that some of the greatest marketing campaigns of all time never talked about performance or product; they talked about people. From the Marlboro Man to Dove’s Real Beauty campaign, they all tell a story.
Seth Godin said it best…
“Marketing is no longer about the stuff that you make, but about the stories you tell.”
B2B companies need to tell their stories. Even better is when they can let their customers do the storytelling for them. Enter stage left…the case study.
Elements of a Case Study
It’s been argued that most epic movies have the same plot. If you think of your favorite films and really analyze the storyline—from Finding Nemo to The Lord of the Rings—you’ll likely find that they all have many themes in common: great characters, conflict and some form of transformation. Building a case study won’t require a Hollywood studio, but it will require adopting some of the same principles. Let’s dive into the three main elements of a successful story to help you tell yours and develop a compelling case study.
The best storylines have characters that are highly relatable and have some endearing qualities that make the reader or viewer “feel” for the character. When it comes to crafting your case study, introduce your characters in a way that others can relate. You may be tempted to position your company as the main character — the hero that swoops in to save the day. But customers should always be the main focus. Why? Because readers want to engage and empathize with a character they can relate to, and they need to be able to put themselves in their shoes.
When sharing about characters in your case studies, help readers understand who they are and what they do. What are the goals and aspirations for their company? While facts and statistics are needed to measure success, don’t forget to introduce the characters in a real and relatable way.
People don’t necessarily want to know what your product or service can do; they want to know what it can do for them. Epic stories have some form of adversity — some problem needs to be overcome. So spell it out. Does a manufacturer have inefficient processes? Is an insurance company trying to compete in a high-tech market with ancient legacy systems? Is a service company losing money because they aren’t attracting sales leads?
Conflict also conjures up emotion. That’s what truly connects us with a story, isn’t it? Don’t be afraid to include information in your case study about the frustrations, fears and anxiety that your customers experienced before you came on the scene. People can relate to what it feels like to stay awake at night worrying about a problem.
In addition, true conflict has high stakes. In the cinematic world, the stakes usually involve the risk of losing something…a kingdom, a relationship, a life. In business, it might involve losing money, customers or opportunities for growth. At first glance, the storyline may not appear as though it will keep people on the edge of their seats, but for the business owner who is wracking his or her brain trying to figure out how to streamline processes to turn a profit and stay afloat, the struggle is real and the stakes are high. The characters in your case study need to have something to lose.
Every great story has a hero. As mentioned previously, many companies want to position themselves as the hero, but are they, really? When looking at great story plots, it’s often the character with the most to lose who ends up being the hero because he or she rose to the challenge, figured out the solution and transformed a situation into a happy ending.
Position your customer as the hero—the character who figured out that he or she had a problem and had the know-how and courage to tackle it head on. In their storyline, they discovered a solution (that would be you), and saved the day.
This is the fun part of the story where you get to share the results of your hero’s good decision to involve your company and how everything fell into place. Provide measurable results, before and after statistics, sales figures, and/or cost savings and other meaningful data. Be specific and give your audience an idea of the results they can potentially experience if they choose to become the hero and invite you into their own story.
Beyond the quantifiable results, talk about those that involve emotion. Talk about how employee morale has skyrocketed, how customer service has noted a marked decrease in the number of complaints they receive, and how relieved the characters in your story feel now that their problem is solved. Use your customer’s own words to tell how thrilled they are with the transformation by including a quote and, when appropriate, include a photo, too.
While writing a case study for your company may not result in a box office hit, using a storytelling approach and these techniques can help you draw readers in and compel them to involve you in their story—their own happy ending. Before you know it, you’ll be writing a sequel that tells of another great success story…and another.
Read other tips we’ve written about developing compelling case studies, and use our Case Study Template to get started. Just click the link below.