Kyle Kabasares 7/14/23 Kyle Kabasares 7/14/23

The Job Hunt Begins + "Think Like a Data Scientist": Chapter 1 Summary

Preface

As I enter the “post-PhD phase” of my life as I like to call it, I have been on the hunt for my next job (Bye, Academia!). Now, I know it’s a joke at this point that STEM PhDs (particularly Physics PhDs like myself), who don’t go into academia try to jump ship into Data Science, but who can blame us? It’s new and exciting, the US Bureau of Labor Statistics predicted a 36% growth in the number of jobs between 2021-2031, it pays extremely well (US median salary of $101,000 in 2021 also from the U.S. Bureau of Labor Statistics), and a Harvard Business Review article from 2012 says it’s the “Sexiest Job of the 21st Century”. I will admit, though, I am a bit sad to relinquish my “Astrophysicist” job title so soon after getting my PhD, but alas…

To help me with my attempt at a transition into Data Science, I started reading the book, “Think Like a Data Scientist” by Dr. Brian Godsey. Now, reading is one thing, but understanding and communicating what you read is another. To help me with the latter two actions, I thought it would be a fun little side project to summarize the book in my own words, chapter-by-chapter to help retain the information I read, so without further ado, let’s begin the summary!

Chapter 1: Philosophies of Data Science

The opening chapter of the book introduces some key philosophies when working as a data scientist, as well as the roles data scientists play within a company or organization. Dr. Godsey states that the greatest asset of a data scientist is awareness. This is particularly important when solving problems that possess many uncertainties. Data scientists need to be able to foresee potential problems before they occur by considering multiple solutions to a problem and assessing the relative strengths and weaknesses of each approach. Knowing where potential road blocks will occur is especially important when it comes to planning for all the challenges a data science problem will have.

A section of this chapter covers the differences between a software developer and a data scientist. While there are commonalities between the two positions, they play different roles in an organization. As the title implies, software developers, develop software (Wow, who would have guessed?) and systems that typically have well-defined components, whereas data scientists typically have to work with systems or processes that aren’t necessarily well-defined (which is why they have to be good at dealing with uncertainty). Software developers typically have specific programs, code libraries, and proper documentation for most of the tools they use, whereas data scientists sometimes have to design their own tools from scratch, without necessarily clear guidance, though thankfully with the advent of open-source software, many data scientists share the tools they build on platforms like GitHub.

The priorities of a data scientist are as follows:

Knowledge: Know the data and the problem you’re trying to solve as well as possible.
Technology: Software is a tool. Don’t let it dictate solution to the problem unless absolutely necessary.
Opinions: Intuition and wishful thinking should only be used as guides to theories that can be falsified and tested, and should not be the sole focus of the project.

Dr. Godsey recommends that IF you were to violate this hierarchy, you should only do so knowingly and for good reason (such as if you’re racing against a time/resource constraint, and thus you need to use the available technology to solve the problem as opposed to building a new tool)

DOCUMENT your code! Don’t be like Kyle who writes code and forgets how it works several months later because he didn’t properly comment it. Also, unlike Kyle, adhere to good coding standards and practices such as PEP8 for Python. To be fair, Kyle or I, didn’t know about any of these things when learning how to program in Python, so the bad habits were implanted from the get-go, but I promise that I will be better!

Use Version Control! It is incredibly important in large data science projects, especially when working within a team. Learn how to use tools such as git and GitHub to keep a record of the all the changes you’ve made to your code.

If you’ve read this far, thanks for reading this summary of Chapter 1! Stay tuned for more summaries to come!