Over the last decade, data science has been steadily growing in popularity. Nowadays, it’s helping to revamp industries like economy, healthcare, transportation, and many more. Naturally, the professionals who make these changes possible — data scientists — are also increasingly in demand. Many aspiring data scientists, however, are unsure which skills they need to hone.
Although numerous authors come up with their own lists of data scientist skills, they prefer to focus on concrete hard skills: “You’ll need to know this technology, this program, and this technique.” In this list, we’re taking a more global approach — that is, we’re covering skills that transcend the latest machine learning frameworks and that will help you become a better data scientist. In this article, we’re aiming to provide you with the definitive answer to the following question: ”What skills does a data scientist need?”
Getting the Definition Right
The data science industry often appears mysterious to outsiders: data scientists are portrayed as sages dabbling in highly advanced technologies like artificial intelligence — to some people, this could very well be magic. But what do they actually do — and how?
If we want to explore data scientists’ skills, we need to analyze the very definition of this profession first:
A person employed to analyze and interpret complex digital data, such as the usage statistics of a website, especially in order to assist a business in its decision-making,
as the Oxford Dictionary puts it. However, we can also take a look at the very essence of this job:
- Their aim is to provide insight (typically, to a business).
- Their tools consist of programming (programming and database management languages, for instance), mathematics, and statistics.
- They achieve their goal by working with data and noticing correlations and trends.
Skill 1: Working with Data
In the last few years, various media publications have been hyping the data science industry:
This led many aspiring data scientists to believe that their day-to-day work routine would be full of excitement and novelty. Which expectations do they usually have?
- “I’ll be working with enormous volumes of data and help the company make millions of dollars!”
- “I’ll implement state of the art deep learning techniques and help the scientific community progress even further!”
- “I’ll mine massive datasets and train machine learning models on them!”
Although these goals are certainly achievable, we need to stress that the term “data science” is part science, but also part data — and the “data” part is often disregarded. In the end, working with data isn’t the most glamorous process because it involves many lower-level activities like cleaning, shaping, or organizing data — and it’s tempting to brush them off as trivial or unimportant. However, we shouldn’t underestimate their importance: these processes are vital for a well-organized data science workflow.
This sentiment is echoed in an article titled "Data science is different now" written by a data science expert Vicki Boykis:
The reality is that “data science” has never been as much about machine learning as it has about cleaning, shaping data, and moving it from place to place…. While tuning models, visualization, and analysis make up some component of your time as a data scientist, data science is and has always been primarily about getting clean data in a single place to be used for interpolation.
To do this, you’re likely to use SQL — a language designed to manage data in relational databases. Unlike the <latest and the coolest machine learning framework>, SQL is a technology that’s here to stay — the concept of relational databases has proven to be instrumental to programming, so learning SQL is a great investment.
How can you acquire this skill? A quick Udemy/Coursera search with the term “SQL” will net you some great courses. The harder part, however, is accepting the importance of SQL: it is by no means a “sexy” or “cool” technology, but it lays the foundation of your skill set as a data scientist.
Skill 2: Readiness to Adapt and Learn
Naturally, this skill can prove useful for any professional, but it rings especially true in the IT industry — the pace of change is just incredibly high. Although the fundamentals of working with data stay relatively the same (the lack of dramatic changes in SQL is a good example), high-end data science does benefit from new technologies.
This skill, however, has an alter ego: the ability not to follow trends blindly. When you see a new technology enter the market, it’s tempting to burn your old infrastructure down and rebuild it with this new tech in mind. In data science, it’s not always clear if the ideas in a particular whitepaper will turn out revolutionary or irrelevant, so novel ideas are best approached critically.
How can you acquire this skill? A good starting point is, well, learning something new! No matter the degree of your proficiency in data science, there should be some areas that you can improve on. One approach is making your knowledge deeper: learning a new framework or mastering that one algorithm. Another approach is making your knowledge wider: improving your communication and presentation skills, learning more about visual storytelling (e.g. how to create data visualizations the right way), and so on.
Learning something new comes with a caveat: sometimes, the learning process can only be done outside of work, which means, in a sense, working even after you finish your work day. Many data scientists (and IT professionals in general) disapprove of this practice. This can be a heated debate, but one thing stays constant: adapting and learning can really help your marketability in the long run.
Then, it’s only a matter of keeping up with the latest trends and scientific findings. DeepAI offers a great newsletter titled “This week in AI” which provides, as the name suggests, weekly recaps of research, job postings, and data science projects.
Skill 3: Mastery Of a Programming Language
Your programming language of choice will be your most important tool: it defines what — and how — you can do with it. On the topic of programming languages, let’s also address the “Python vs. R” debate (as these languages are both used in the data science field): Python’s immense popularity can lead you to believe that R has been rendered obsolete. Although Python does dominate the data science industry, R still has its uses in the following areas:
- Data analysis,
- Data visualization (with packages like ggplot2 and ggedit),
- Quantitative finance,
- And more.
The question, therefore, shouldn’t be “Is A better than B?”, but rather “Do I need A or B for this task?”. When it comes to Python, it’s a natural choice for most data science-related activities: Python offers a rich set of libraries for mathematics, machine learning, and automation.
There is, however, a fine line between “knowing a programming language” and “knowing a programming language well”. The difference between these two categories may be hard to spot in some domains, but in data science, it is critical to write optimized code: a suboptimal choice of algorithms, for instance, can cost you a lot of compute resources.
It’s natural for data scientists to hail their favorite tools — Pythonistas, for instance, are always eager to talk about Python. To become a better data scientist, however, you need to adopt a more critical approach. This skill consists of two parts:
- A deep understanding of the language’s inner workings: how to optimize it and what its strengths are.
- An even deeper understanding of the language’s shortcomings: what its limitations are and, if it comes to that, which other technologies can be used instead.
How can you acquire this skill? Together with hands-on experience, books like Fluent Python and High Performance Python can help you obtain a better understanding of Python. The trick, of course, is combining reading and implementing what you’ve read.
Skill 4: Avoiding Bias in Data
Although not a hard skill in and of itself, an ethical approach (i.e. staying wary of potential bias in data) to data science is important. When working with large volumes of information, the data scientist’s perception may become skewed. To counter this, we need to remember a simple formula: “correlation != causation”. Data scientists often encounter certain trends in the data they work with — and it’s tempting to reach the easiest (i.e. the most obvious) conclusion:
“The company’s revenue has been on a sharp decline since November 28th… November 28th was when we allowed some of our employees to work remotely!
For our imaginary company above, remote work may have very well caused a revenue drop — maybe it disrupted the company’s internal processes. The keyword here is “maybe”: maybe it did, maybe some other policies were at fault. This (somewhat) humorous example will probably make you say “Nonsense! It’s obvious that there is no connection!”, but we have another truism to counter your skepticism:
“Obvious is only obvious in retrospect.”
Upon making a silly mistake, we think that it was so easy to avoid it. In the moment, however, we aren’t that far-sighted. Here’s an important takeaway: data often blinds us and makes us biased, so a critical approach is ever so useful.
For a more realistic example, let’s imagine that we have a website called FruitTube — a platform where users can, well, upload video reviews of various fruits. After some time, we notice that the “Orange” category is by far the most popular one… but why?
- Maybe oranges are indeed people’s favorite fruit.
- Maybe the videos themselves are of higher production quality…
- … more amusing...
- … or more interesting in this category, so users find them more enjoyable to watch.
- Maybe people who prefer other fruits are a different demographic and they’re not sufficiently represented on our website.
- Maybe an error in the codebase prevents users from pressing the “Like” button in other categories.
In this article, we’ve examined four skills that can help you become a better data scientist:
- Become proficient with SQL to work with data. SQL is a fundamental technology.
- Get ready to adapt and learn. This can help you preserve your proficiency and marketability.
- Master a programming language. Maximize the effectiveness of your programs.
- Stay aware of potential bias. Don't let data cloud your judgement.