"Think Like a Data Scientist": Chapter 2 Summary

Chapter 2: Setting goals by asking good questions

The second chapter of Think Like a Data Scientist is all about asking good questions in data science projects. Good questions help clarify subtleties in data science projects and help identify potential future pitfalls and roadblocks. The answers to these questions help data scientists devise plans in anticipation of these roadblocks, and give them multiple plans of attack against the problem. In my mind, it sounds like a game of chess or a big branching decision tree filled with “If this, then that, but if instead this, then instead THAT!”. That’s my attempt at simplifying the concepts at least.

There are two main sources that must be questioned: the customer, and the data itself. The customer is the individual or organization who is paying you to do the data science project. Customers could be businesses, academics, government agencies, or individuals within those organizations. Regardless, it is imperative to ask the customer good questions because they have the domain-specific knowledge. Perhaps you’re a data scientist working in finance, and your customer is an investment banker trying to use data to make good investments, or perhaps your customer is a botanist asking you to design a plant-identifying algorithm. Who knows? Regardless, there is a good chance as a data scientist that you will not have the domain-specific knowledge to the problem. Asking good questions to the customer such as what they expect to see in the final product or report, or what they think will be the most valuable pieces of data to examine, is vital.

Interrogating data is also part of the job. Before getting started on a data science project, it is incredibly important to know what exactly your data consists of. This means knowing what format it is in, what information it contains, and importantly, can it answer the question you’re being asked to answer? Also, it’s important to highlight that good questions are concrete in their assumptions. By concrete, this means the assumptions are both well-defined and testable. Having assumptions that are ill-defined and non-falsifiable is often a recipe for disaster. Some questions that you might ask as a data scientist, generally speaking, may involve questions regarding the statistical properties of the dataset (Can I assume normality? Are the data points independent of each other?, etc).

Lastly, a data science project should have a set of goals in mind from the start. The three key questions to ask when coming up with goals are:

  1. What’s possible?

  2. What’s valuable?

  3. What’s efficient?

As discussed in the Chapter 1 summary, uncertainty is the name of the game when it comes to data science. It may not be clear if a goal is possible at the start without first asking and answering some good questions. In terms of value, it is worth discussing with the customer what you think would add value to the final deliverable product. For example, if the product is a web application for customers, what would it look like/what features would it provide/what would customers find helpful? Finally, efficiency is defined as the following equation:

$$ Efficiency = \frac{Value}{Effort} \times Possibility $$

as in, the overall efficiency in achieving a goal is the ratio of the amount of value it delivers to the amount of effort it required to achieve, multiplied by the probability of achieving the goal. These three components all play a role in the efficiency of a project, and data scientists often need to make decisions whether the value a goal delivers is worth the effort or if it’s even possible in the first place.

That’s it for my summary on Chapter 2 of Think Like a Data Scientist by Dr. Brian Godsey! I’ll be reading Chapter 3 soon and will write a blog post about it soon!