*Data Science is a term that escapes any single complete definition, which makes it difficult to use, especially if the goal is to use it correctly. Most articles and publications use the term freely, with the assumption that it is universally understood. However, data science – its methods, goals, and applications – evolve with time and technology. Data science 25 years ago referred to gathering and cleaning datasets then applying statistical methods to that data. In 2018, data science has grown to a field that encompasses data analysis, predictive analytics, data mining, business intelligence, machine learning, and so much more. In fact, because no one definition fits the bill seamlessly, it is up to those who do data science to define it. Recognising the need for a clear-cut explanation of data science, the 365 Data Science Team designed the What-Where-Who infographic. We define the key processes in data science and disseminate the field. Here is our interpretation of data science.*

**Data science – a universally recognizable term that is in desperate need of dissemination.**## Data science, ‘explained in under a minute’, looks like this.

You have data. To use this data to inform your decision-making, it needs to be relevant, well-organised, and preferably digital. Once your data is coherent, you proceed with analysing it, creating dashboards and reports to understand your business’s performance better. Then you set your sights to the future and start generating predictive analytics. With predictive analytics, you assess potential future scenarios and predict consumer behaviour in creative ways.*Author’s note: You can learn more about how data science and business interact in our article 5 Business Basics for Data Scientists.*But let’s start at the beginning.

**The Data in Data Science**

Before anything else, there is always data. Data is the foundation of data science; it is the material on which all the analyses are based. In the context of data science, there are two types of data: traditional, and big data.
Traditional data is data that is structured and stored in databases which analysts can manage from one computer; it is in table format, containing numeric or text values. Actually, the term “traditional” is something we are introducing for clarity. It helps emphasize the distinction between big data and other types of data.
Big data, on the other hand, is… bigger than traditional data, and not in the trivial sense. From variety (numbers, text, but also images, audio, mobile data, etc.), to velocity (retrieved and computed in real time), to volume (measured in tera-, peta-, exa-bytes), big data is usually distributed across a network of computers.
That said, let’s define the What-Where-and-Who in data science each is characterized by.
**What do you do to Data in Data Science?**

**Traditional data in Data Science**

Traditional data is stored in relational database management systems.
- Collect raw data and store it on a server

- Class-label the observations

- Data cleansing/data scrubbing

- Data balancing

**data balancing**methods, like extracting an equal number of observations for each category, and preparing that for processing, fixes the issue.

- Data shuffling

**Big Data in Data Science**

When it comes to big data and data science, there is some overlap of the approaches used in traditional data handling, but there are also a lot of differences.
First of all, big data is stored on many servers and is infinitely more complex.
*data*.

- Collect the data
- Class-label the data

- Data cleansing

- Data masking

**Where does data come from?**

Traditional data may come from basic customer records, or historical stock price information.
Big data, however, is all-around us. A consistently growing number of companies and industries use and generate big data. Consider online communities, for example, Facebook, Google, and LinkedIn; or financial trading data. Temperature measuring grids in various geographical locations also amount to big data, as well as machine data from sensors in industrial equipment. And, of course, wearable tech.
**Who handles the data?**

The data specialists who deal with raw data and pre-processing, with creating databases, and maintaining them can go by a different name. But although their titles are similar sounding, there are palpable differences in the roles they occupy. Consider the following.
**Data Architects** and **Data Engineers** (and Big Data Architects, and Big Data Engineers, respectively) are crucial in the data science market.

The former creates the database from scratch; they design the way data will be retrieved, processed, and consumed. Consequently, the data engineer uses the data architects’ work as a stepping stone and processes (pre-processes) the available data. They are the people who ensure the data is clean and organized and ready for the analysts to take over.
The **Database Administrator**, on the other hand, is the person who controls the flow of data into and from the database. Of course, with Big Data almost the entirety of this process is automated, so there is no real need for a human administrator. The Database Administrator deals mostly with traditional data. That said, once data processing is done, and the databases are clean and organised, the real data science begins.

**Data Science**

There are also two ways of looking at data: with the intent to explain behaviour that has already occurred, and you have gathered data for it; or to use the data you already have in order to predict future behaviour that has not yet happened.
**Data Science explaining the past**

**Business Intelligence**

Before data science jumps into predictive analytics, it must look at the patterns of behaviour the past provides, analyse them to draw insight and inform the path for forecasting. Business intelligence focuses precisely on this: providing data-driven answers to questions like: *How many units were sold? In which region were the most goods sold? Which type of goods sold where?*

*How did the email marketing perform last quarter in terms of click-through rates and revenue generated? How does that compare to the performance in the same quarter of last year?*Although Business Intelligence does not have “data science” in its title, it is part of data science, and not in any trivial sense.

**What does Business Intelligence do?**

Of course, Business Intelligence Analysts can apply Data Science to measure business performance. But in order for the Business Intelligence Analyst to achieve that, they must employ specific data handling techniques.
The starting point of all data science is data. Once the relevant data is in the hands of the BI Analyst (monthly revenue, customer, sales volume, etc.), they must quantify the observations, calculate KPIs and examine measures to extract insights from their data.
**Data Science is about telling a story**

Apart from handling strictly numerical information, data science, and specifically business intelligence, is about visualizing the findings, and creating easily digestible images supported only by the most relevant numbers. After all, all levels of management should be able to understand the insights from the data and inform their decision-making.
**Where is business intelligence used?**

**Price optimisation and data science**

Notably, analysts apply data science to inform things like price optimisation techniques. They extract the relevant information in real time, compare it with historicals, and take actions accordingly. Consider hotel management behaviour: management raise room prices during periods when many people want to visit the hotel and reduce them when the goal is to attract visitors in periods with low demand.
**Inventory management and data science**

Data science, and business intelligence, are invaluable for handling over and undersupply. In-depth analyses of past sales transactions identify seasonality patterns and the times of the year with the highest sales, which results in the implementation of effective inventory management techniques that meet demands at minimum cost.
**Who does the BI branch of data science?**

A BI analyst focuses primarily on analyses and reporting of past historical data.
The BI consultant is often just an ‘external BI analysts’. Many companies outsource their data science departments as they don’t need or want to maintain one. BI consultants would be BI analysts had they been employed, however, their job is more varied as they hop on and off different projects. The dynamic nature of their role provides the BI consultant with a different perspective, and whereas the BI Analyst has highly specialized knowledge (i.e., depth), the BI consultant contributes to the breadth of data science.
The BI developer is the person who handles more advanced programming tools, such as Python and SQL, to create analyses specifically designed for the company. It is the third most frequently encountered job position in the BI team.
**Data Science predicting the future**

Predictive analytics in data science rest on the shoulders of explanatory data analysis, which is precisely what we were discussing up to this point. Once the BI reports and dashboards have been prepared and insights – extracted from them – this information becomes the basis for predicting future values. And the accuracy of these predictions lies in the methods used.
**Recall the distinction between traditional data and big data in data science.**

We can make a similar distinction regarding predictive analytics and their methods: traditional data science methods vs. Machine Learning. One deals primarily with traditional data, and the other – with big data.
**Traditional forecasting methods in Data Science: What are they?**

Traditional forecasting methods comprise the classical statistical methods for forecasting – linear regression analysis, logistic regression analysis, clustering, factor analysis, and time series. The output of each of these feeds into the more sophisticated machine learning analytics, but let’s first review them individually.
A quick side-note. Some in the data science industry refer to several of these methods as machine learning too, but in this article machine learning refers to newer, smarter, better methods, such as deep learning.
**Linear regression**

In data science, the linear regression model is used for quantifying causal relationships among the different variables included in the analysis. Like the relationship between house prices, the size of the house, the neighborhood, and the year built. The model calculates coefficients with which you can predict the price of a new house, if you have the relevant information available.
*If you’re curious about the geometrical representation of the simple linear regression model, check out the linked tutorial.*

**Logistic regression**

Since it’s not possible to express all relationships between variables as linear, data science makes use of methods like the logistic regression to create non-linear models. Logistic regression operates with 0s and 1s. Companies apply logistic regression algorithms to filter job candidates during their screening process. If the algorithm estimates that the probability that a prospective candidate will perform well in the company within a year is above 50%, it would predict 1, or a successful application. Otherwise, it will predict 0.
**Cluster analysis**

This exploratory data science technique is applied when the observations in the data form groups according to some criteria. Cluster analysis takes into account that some observations exhibit similarities, and facilitates the discovery of new significant predictors, ones that were not part of the original conceptualisation of the data.
**Factor analysis**

If clustering is about grouping *observations*together, factor analysis is about grouping

*features*together. Data science resorts to using factor analysis to reduce the dimensionality of a problem. For example, if in a 100-item questionnaire each 10 questions pertain to a single general attitude, factor analysis will identify these 10 factors, which can then be used for a regression that will deliver a more interpretable prediction. A lot of the techniques in data science are integrated like this.

**Time series analysis**

Time series is a popular method for following the development of specific values over time. Experts in economics and finance use it because their subject matter is stock prices and sales volume – variables that are typically plotted against time.
**Where does data science find application for traditional forecasting methods?**

The application of the corresponding techniques is extremely broad; data science is finding a way into an increasingly large number of industries. That said, two prominent fields deserve to be part of the discussion.
**User experience (UX) and data science**

When companies launch a new product, they often design surveys that measure the attitudes of customers towards that product. Analysing the results after the BI team has generated their dashboards includes grouping the observations into segments (e.g. regions), and then analysing each segment separately to extract meaningful predictive coefficients. The results of these operations often corroborate the conclusion that the product needs slight but significantly different adjustments in each segment in order to maximise customer satisfaction.
**Forecasting sales volume**

This is the type of analysis where time series comes into play. Sales data has been gathered until a certain date, and the data scientist wants to know what is likely to happen in the next sales period, or a year ahead. They apply mathematical and statistical models and run multiple simulations; these simulations provide the analyst with future scenarios. This is at the core of data science, because based on these scenarios, the company can make better predictions and implement adequate strategies.
**Who uses traditional forecasting methods?**

The data scientist. But bear in mind that this title also applies to the person who employs machine learning techniques for analytics, too. A lot of the work spills from one methodology to the other.
The data analyst, on the other hand, is the person who prepares advanced types of analyses that explain the patterns in the data that have already emerged and overlooks the basic part of the predictive analytics. Of course, if you’re eager to learn more details about what a data scientist does and how their job compares to other career paths in the data science field, read our **ultimate guide on how to start a career in data science**.

**Machine Learning and Data Science**

Machine learning is the state-of-the-art approach to data science. And rightly so.
The main advantage machine learning has over any of the traditional data science techniques is the fact that at its core resides **the algorithm**. These are the directions a computer uses to find a model that fits the data as well as possible. The difference between machine learning and traditional data science methods is that we do not give the computer instructions on how to find the model; it takes the algorithm and uses its directions to learn on its own how to find said model. Unlike in traditional data science, machine learning needs little human involvement. In fact, machine learning, especially deep learning algorithms are so complicated, that humans cannot genuinely understand what is happening “inside”.

## Add comment