Data science deals with large amount of data and data scientist analyse that data to extract useful information from that data. This data analysis process involve 5 steps (1). In this post we will discuss those 5 steps involved in data analysis process and further we will explore some of the challenges we face during each step.
In previous post we concluded that data science is the mixture of computing methods and statistical methods. In both data science and statistics, the core objective is to analyse the data. But in data science we automate some of the steps involved in data analysis process, which is the major difference between these fields.
5 steps involved in data analysis process:
These 5 steps involved in data analysis process are mentioned in a paper named as ‘Enterprise Data Analysis and Visualization: An Interview Study’ by Sean Kandel and companions. It is an interview study conducted by the authors of the paper where they interviewed 35 data scientists from 25 organizations. They mention it as follows:
“To better understand the enterprise analysts’ ecosystem, we conducted semi-structured interviews with 35 data analysts from 25 organizations across a variety of sectors, including healthcare, retail, marketing and finance. Based on our interview data, we characterize the process of industrial data analysis and document how organizational features of an enterprise impact it”
The 5 data analysis steps mentioned in the paper are as follows:
First step in data analysis process is to discover / collect data for analysis. Data can be gathered from multiple sources like database tables, log files, spreadsheets or from an online source. The challenges involved in this phase are finding relevant data and interpreting certain fields in the database tables etc.
Once the data is collected, the next step is wrangling or cleaning the available data. Data manipulation and integration of data obtained from multiple sources are the main tasks performed in this phase. Some of the issues data scientists face in this phase are processing semi-structured data e.g data received from log files, integration of data obtained from diverse sources etc.
Before using the available data in any analysis, we need to make sure that there are no issues in our data. Data may have quality issues like missing, erroneous or extreme values which may affect the analysis results. In this phase data analyst make sure that there are no anomalies in the data that we are going to use in our analysis.
In this phase data analyst decides the features, scale and statistical method to be used for the analysis process. Some of the issues faced during this phase are relevant features selection and data size scale issues with data analyzing tools etc.
In this final step, insights gained from the analysis process are reported. Communicating the assumptions involved in analysis process effectively and static reports (i.e no interactive method to check the results) are some of the points need to be considered in this phase.
In data analysis process 5 phases are involved namely discovery, wrangling, profiling, modeling and reporting. Some data analysis may exclude some of the steps depending upon the nature of the data analysis. Some of the issues faced by data analyst’s during each phase are also discussed in this post.
(1) Enterprise Data Analysis and Visualization: An Interview Study by Sean Kandel and companions