Data Analyst Interview Questions
Data Analyst Interview Questions for Freshers
Welcome! At Yes-M Systems, we understand that starting your career as a Data Analyst can be exciting. To help you prepare, we’ve compiled a set of interview questions tailored for freshers. These questions will assess your foundational knowledge, problem-solving abilities, and understanding of core data analysis principles.
What do you mean by Data Analysis?
Data analysis is a multidisciplinary field of data science, in which data is analyzed using mathematical, statistical, and computer science with domain expertise to discover useful information or patterns from the data. It involves gathering, cleaning, transforming, and organizing data to draw conclusions, forecast, and make informed decisions. The purpose of data analysis is to turn raw data into actionable knowledge that may be used to guide decisions, solve issues, or reveal hidden trends.
What is Data Analytics?
Collecting data from different sources, cleaning it using various tools, technologies & algorithms, analysing and generating meaningful insights for business problem solving or improving customer experience/engagement or enhancing business growth is data analytics.
What is the difference between Analysis and Analytics?
Analysis and Analytics have more or less the same meaning & used in different contexts.
Analysis – Analysis is a collection of information/data, examining it carefully, finding out patterns, trends, and characteristics of the collected data and drawing some meaningful findings to take corrective measures or mitigate the risks. Usually, this is based on historical data to assess the current situation or the problem area. The data is broken down into small components and analysed carefully to drive business decisions.
Some examples are Root-Cause Analysis (RCA) using a fishbone diagram, customer sentiment analysis using NLP & ML models, employee attrition analysis to enhance retention using statistical models etc.
Analytics – Analytics is used in the broader sense where the data is collected systematically from various sources, pre-processing the data using statistical models/mechanisms and generating some sense out of this data for business decisions. This is not only based on historical data but also on existing/current data to train the models to predict, and forecast future trends and find out unfold opportunities for business growth.
There are four types of analytics:
- Descriptive Analytics –Describes the current situation, trends & position of the organization compared with previous year/month results.
- Diagnostic Analytics –This analytics is deep diving into the collected data and finding the reasons behind generated trends and why something happened in which event. This will help in assessing opportunities to improve.
- Prescriptive Analytics –This analytics prescribes the data to take corrective measures to make progress or avoid a particular event in future.
- Predictive Analytics –It uses Machine Learning models to predict future trends, events and outcomes. It uses historical & current data to forecast accurately for better business growth.
Some examples are Sales data analytics for future trends & forecasts, disease detection & prevention, resource optimization etc.
What does a Data Analyst do?
Data Analyst:
- Collects & complies the data from various sources
- Pre-processing the data to remove null values, duplicates, format issues, errors and outliers to make data clean and good quality
- Does descriptive, diagnostic & prescriptive analysis of the data using statistical & ML models
- Develop reports/dashboards using visualization tools like PowerBI, Tableau or QlikView to generate insights
- Does predictive analytics based on the need or problem statement
- Communicate findings/results to stakeholders & leadership to make business decisions
What are the different tools mainly used for data analysis?
There are different tools used for data analysis. each has some strengths and weaknesses. Some of the most used tools for data analysis are as follows:
- Spreadsheet Software: Spreadsheet Software is used for a variety of data analysis tasks, such as sorting, filtering, and summarizing data. It also has several built-in functions for performing statistical analysis. The top 3 mostly used Spreadsheet Software are as follows:
-
- Microsoft Excel
-
- Google Sheets
- Database Management Systems (DBMS): DBMSs, or database management systems, are crucial resources for data analysis. It offers a secure and efficient way to manage, store, and organize massive amounts of data.
-
- MySQL
-
- PostgreSQL
-
- Microsoft SQL Server
-
- Oracle Database
- Statistical Software: There are many statistical software used for Data analysis, Each with its strengths and weaknesses. Some of the most popular software used for data analysis are as follows:
-
- SAS: Widely used in various industries for statistical analysis and data management.
-
- SPSS: A software suite used for statistical analysis in social science research.
-
- Stata: A tool commonly used for managing, analyzing, and graphing data in various fields.SPSS:
- Programming Language: In data analysis, programming languages are used for deep and customized analysis according to mathematical and statistical concepts. For Data analysis, two programming languages are highly popular:
-
- R: R is a free and open-source programming language widely popular for data analysis. It has good visualizations and environments mainly designed for statistical analysis and data visualization. It has a wide variety of packages for performing different data analysis tasks.
-
- Python: Python is also a free and open-source programming language used for Data analysis. Nowadays, It is becoming widely popular among researchers. Along with data analysis, It is used for Machine Learning, Artificial Intelligence, and web development.
How do data analysts differ from data scientists?
Data analysts and Data Scientists can be recognized by their responsibilities, skill sets, and areas of expertise. Sometimes the roles of data analysts and data scientists may conflict or not be clear.
Data analysts are responsible for collecting, cleaning, and analyzing data to help businesses make better decisions. They typically use statistical analysis and visualization tools to identify trends and patterns in data. Data analysts may also develop reports and dashboards to communicate their findings to stakeholders.
Data scientists are responsible for creating and implementing machine learning and statistical models on data. These models are used to make predictions, automate jobs, and enhance business processes. Data scientists are also well-versed in programming languages and software engineering.
Feature | Data analyst | Data Scientist |
Skills | Excel, SQL, Python, R, Tableau, PowerBI | Machine Learning, Statistical Modeling, Docker, Software Engineering |
Tasks | Data Collection, Web Scrapping, Data Cleaning, Data Visualization, Explanatory Data Analysis, Reports Development and Presentations | Database Management, Predictive Analysis and prescriptive analysis, Machine Learning model building and Deployment, Task automation, Work for Business Improvements Process. |
How Data analysis is similar to Business Intelligence?
Data analysis and Business intelligence are both closely related fields, Both use data and make analysis to make better and more effective decisions. However, there are some key differences between the two.
- Data analysis involves data gathering, inspecting, cleaning, transforming and finding relevant information, So, that it can be used for the decision-making process.
- Business Intelligence(BI) also makes data analysis to find insights as per the business requirements. It generally uses statistical and Data visualization tools popularly known as BI tools to present the data in user-friendly views like reports, dashboards, charts and graphs.
The similarities and differences between the Data Analysis and Business Intelligence are as follows:
Similarities | Differences |
Both use data to make better decisions. | Data analysis is more technical, while BI is more strategic. |
Both involve collecting, cleaning, and transforming data. | Data analysis focuses on finding patterns and insights in data, while BI focuses on providing relevant information |
Both use visualization tools to communicate findings. | Data analysis is often used to provide specific answers, whereas business intelligence (BI) is used to help broader decision-making. |
What is Data Wrangling?
Data Wrangling is very much related concepts to Data Preprocessing. It’s also known as Data munging. It involves the process of cleaning, transforming, and organizing the raw, messy or unstructured data into a usable format. The main goal of data wrangling is to improve the quality and structure of the dataset. So, that it can be used for analysis, model building, and other data-driven tasks.
Data wrangling can be a complicated and time-consuming process, but it is critical for businesses that want to make data-driven choices. Businesses can obtain significant insights about their products, services, and bottom line by taking the effort to wrangle their data.
Some of the most common tasks involved in data wrangling are as follows:
- Data Cleaning: Identify and remove the errors, inconsistencies, and missing values from the dataset.
- Data Transformation: Transformed the structure, format, or values of data as per the requirements of the analysis. that may include scaling & normalization, encoding categorical values.
- Data Integration: Combined two or more datasets, if that is scattered from multiple sources, and need of consolidated analysis.
- Data Restructuring: Reorganize the data to make it more suitable for analysis. In this case, data are reshaped to different formats or new variables are created by aggregating the features at different levels.
- Data Enrichment: Data are enriched by adding additional relevant information, this may be external data or combined aggregation of two or more features.
Quality Assurance: In this case, we ensure that the data meets certain quality standards and is fit for analysis.
What is Data Profiling?
- Data profiling in data analytics is a proactive approach to examining the transformed data, analysing it from various angles and creating useful summaries & trends around the data.
- This process uncovers the metadata of data to determine its legitimacy, functional dependency, relationship and data quality to overcome the bad data that usually costs the organizations. The profiled information can be used to reduce small issues in data that may cause big problems in future.
What is the difference between descriptive and predictive analysis?
Descriptive and predictive analysis are the two different ways to analyze the data.
- Descriptive Analysis: Descriptive analysis is used to describe questions like “What has happened in the past?” and “What are the key characteristics of the data?”. Its main goal is to identify the patterns, trends, and relationships within the data. It uses statistical measures, visualizations, and exploratory data analysis techniques to gain insight into the dataset.
The key characteristics of descriptive analysis are as follows:
-
- Historical Perspective: Descriptive analysis is concerned with understanding past data and events.
-
- Summary Statistics: It often involves calculating basic statistical measures like mean, median, mode, standard deviation, and percentiles.
-
- Visualizations: Graphs, charts, histograms, and other visual representations are used to illustrate data patterns.
-
- Patterns and Trends: Descriptive analysis helps identify recurring patterns and trends within the data.
-
- Exploration: It’s used for initial data exploration and hypothesis generation.
- Predictive Analysis: Predictive Analysis, on the other hand, uses past data and applies statistical and machine learning models to identify patterns and relationships and make predictions about future events. Its primary goal is to predict or forecast what is likely to happen in future.
The key characteristics of predictive analysis are as follows:
-
- Future Projection: Predictive analysis is used to forecast and predict future events.
-
- Model Building: It involves developing and training models using historical data to predict outcomes.
-
- Validation and Testing: Predictive models are validated and tested using unseen data to assess their accuracy.
-
- Feature Selection: Identifying relevant features (variables) that influence the predicted outcome is crucial.
-
- Decision Making: Predictive analysis supports decision-making by providing insights into potential outcomes.
What are the steps you would take to analyze a dataset?
Data analysis involves a series of steps that transform raw data into relevant insights, conclusions, and actionable suggestions. While the specific approach will vary based on the context and aims of the study, here is an approximate outline of the processes commonly followed in data analysis:
- Problem Definition or Objective: Make sure that the problem or question you’re attempting to answer is stated clearly. Understand the analysis’s aims and objectives to direct your strategy.
- Data Collection: Collate relevant data from various sources. This might include surveys, tests, databases, web scraping, and other techniques. Make sure the data is representative and accurate.
- Data Preprocessing or Data Cleaning: Raw data often has errors, missing values, and inconsistencies. In Data Preprocessing and Cleaning, we redefine the column’s names or values, standardize the formats, and deal with the missing values.
- Exploratory Data Analysis (EDA): EDA is a crucial step in Data analysis. In EDA, we apply various graphical and statistical approaches to systematically analyze and summarize the main characteristics, patterns, and relationships within a dataset. The primary objective behind the EDA is to get a better knowledge of the data’s structure, identify probable abnormalities or outliers, and offer initial insights that can guide further analysis.
- Data Visualizations: Data visualizations play a very important role in data analysis. It provides visual representation of complicated information and patterns in the data which enhances the understanding of data and helps in identifying the trends or patterns within a data. It enables effective communication of insights to various stakeholders.
What is data cleaning?
Data cleaning is the process of identifying the removing misleading or inaccurate records from the datasets. The primary objective of Data cleaning is to improve the quality of the data so that it can be used for analysis and predictive model-building tasks. It is the next process after the data collection and loading.
In Data cleaning, we fix a range of issues that are as follows:
- Inconsistencies: Sometimes data stored are inconsistent due to variations in formats, columns name, data types, or values naming conventions. Which creates difficulties while aggregating and comparing. Before going for further analysis, we correct all these inconsistencies and formatting issues.
- Duplicate entries:Duplicate records may biased analysis results, resulting in exaggerated counts or incorrect statistical summaries. So, we also remove it.
- Missing Values:Some data points may be missing. Before going further either we remove the entire rows or columns or we fill the missing values with probable items.
- Outlier: Outliers are data points that drastically differ from the average which may result in machine error when collecting the dataset. if it is not handled properly, it can bias results even though it can offer useful insights. So, we first detect the outlier and then remove it.
What is the difference between Quantitative versus qualitative data analysis?
- Quantitative data analytics is done on numerical/numbers using various mathematical calculations and statistical methodologies to find the patterns, trends and relationships between different features.
- Some examples are financial data, ratings, clinical research, demographic data analytics etc.
- Qualitative data analytics is around the examination & interpretation of non-numerical data to find out patterns, themes & senses of the data.
- Some examples are case studies, surveys, interviews and feedbacks etc.
What's the difference between structured and unstructured data?
Structured and unstructured data depend on the format in which the data is stored. Structured data is information that has been structured in a certain format, such as a table or spreadsheet. This facilitates searching, sorting, and analyzing. Unstructured data is information that is not arranged in a certain format. This makes searching, sorting, and analyzing more complex.
The differences between the structured and unstructured data are as follows:
Feature | Structured Data | Unstructured Data |
Structure of data | Schema (structure of data) is often rigid and organized into rows and columns | No predefined relationships between data elements. |
Searchability | Excellent for searching, reporting, and querying | Difficult to search |
Analysis | Simple to quantify and process using standard database functions. | No fixed format, making it more challenging to organize and analyze. |
Storage | Relational databases | Data lakes |
Examples | Customer records, product inventories, financial data | Text documents, images, audio, video |
SQL Interview Questions for Data Analysts
What is DBMS?
DBMS stands for Database Management System. It is software designed to manage, store, retrieve, and organize data in a structured manner. It provides an interface or a tool for performing CRUD operations into a database. It serves as an intermediary between the user and the database, allowing users or applications to interact with the database without the need to understand the underlying complexities of data storage and retrieval.
What are the basic SQL CRUD operations?
SQL CRUD stands for CREATE, READ(SELECT), UPDATE, and DELETE statements in SQL Server. CRUD is nothing but Data Manipulation Language (DML) Statements. CREATE operation is used to insert new data or create new records in a database table, READ operation is used to retrieve data from one or more tables in a database, UPDATE operation is used to modify existing records in a database table and DELETE is used to remove records from the database table based on specified conditions.
What is the SQL statement used to insert new records into a table?
We use the ‘INSERT’ statement to insert new records into a table. The ‘INSERT INTO’ statement in SQL is used to add new records (rows) to a table.
How do you filter records using the WHERE clause in SQL?
We can filter records using the ‘WHERE’ clause by including ‘WHERE’ clause in ‘SELECT’ statement, specifying the conditions that records must meet to be included.
How can you sort records in ascending or descending order using SQL?
We can sort records in ascending or descending order by using ‘ORDER BY; clause with the ‘SELECT’ statement. The ‘ORDER BY’ clause allows us to specify one or more columns by which you want to sort the result set, along with the desired sorting order i.e ascending or descending order.
Explain the purpose of the GROUP BY clause in SQL.
The purpose of GROUP BY clause in SQL is to group rows that have the same values in specified columns. It is used to arrange different rows in a group if a particular column has the same values with the help of some functions.
How do you perform aggregate functions like SUM, COUNT, AVG, and MAX/MIN in SQL?
An aggregate function groups together the values of multiple rows as input to form a single value of more significant meaning. It is also used to perform calculations on a set of values and then returns a single result. Some examples of aggregate functions are SUM, COUNT, AVG, and MIN/MAX.
Explain the different types of joins in SQL.
A JOIN is used to bring together data from two or more tables by utilizing a common column that is present in each table. We can use INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN. These JOIN variations are defined by the manner in which data from the involved tables is paired and retrieved.
How can you write an SQL query to retrieve data from multiple related tables?
To retrieve data from multiple related tables, we generally use ‘SELECT’ statement along with help of ‘JOIN’ operation by which we can easily fetch the records from the multiple tables. Basically, JOINS are used when there are common records between two tables. There are different types of joins i.e. INNER, LEFT, RIGHT, FULL JOIN. In the above question, detailed explanation is given regarding JOIN so you can refer that.
What is a subquery in SQL? How can you use it to retrieve specific data?
A subquery is defined as query with another query. A subquery is a query embedded in WHERE clause of another SQL query. Subquery can be placed in a number of SQL clause: WHERE clause, HAVING clause, FROM clause. Subquery is used with SELECT, INSERT, DELETE, UPDATE statements along with expression operator. It could be comparison or equality operator such as =>,=,<= and like operator.
What is the purpose of the HAVING clause in SQL? How is it different from the WHERE clause?
In SQL, the HAVING clause is used to filter the results of a GROUP BY query depending on aggregate functions applied to grouped columns. It allows you to filter groups of rows that meet specific conditions after grouping has been performed. The HAVING clause is typically used with aggregate functions like SUM, COUNT, AVG, MAX, or MIN.
The main differences between HAVING and WHERE clauses are as follows:
HAVING | WHERE |
The HAVING clause is used to filter groups of rows after grouping. It operates on the results of aggregate functions applied to grouped columns. | The WHERE clause is used to filter rows before grouping. It operates on individual rows in the table and is applied before grouping and aggregation. |
The HAVING clause is typically used with GROUP BY queries. It filters groups of rows based on conditions involving aggregated values. | The WHERE clause can be used with any SQL query, whether it involves grouping or not. It filters individual rows based on specified conditions. |
In the HAVING clause, you generally use aggregate functions (e.g., SUM, COUNT) to reference grouped columns and apply conditions to groups of rows. | In the WHERE clause, you can reference columns directly and apply conditions to individual rows. |
Explain window functions in SQL. How do they differ from regular aggregate functions?
In SQL, window functions provide a way to perform complex calculations and analysis without the need for self-joins or subqueries.
Window vs Regular Aggregate Function
Window Functions | Aggregate Functions |
Window functions perform calculations within a specific “window” or subset of rows defined by an OVER() clause. It can be customized based on specific criteria, such as rows with the same values in a certain column or rows that are ordered in a specific way. | Regular aggregate functions operate on the entire result set and return a single value for the entire set of rows. |
Window functions return a result for each row in the result set based on its specific window. Each row can have a different result. | Aggregate functions return a single result for the entire dataset. Each row receives the same value. |
Window functions provide both an aggregate result and retain the details of individual rows within the defined window. | Regular aggregates provide a summary of the entire dataset, often losing detail about individual rows. |
Window functions require the use of the OVER() clause to specify the window’s characteristics, such as the partitioning and ordering of rows. | Regular aggregate functions do not use the OVER() clause because they do not have a notion of windows. |
How can you optimize the performance of a slow SQL query?
Indexing columns used in WHERE, JOIN, and ORDER BY clauses
– Optimizing database configuration and hardware resources
– Simplifying complex queries and reducing joins
– Using efficient data types and minimizing data size
– Avoiding SELECT *
Explain the concept of database normalization and its importance.
Database Normalization is the process of reducing data redundancy in a table and improving data integrity. It is a way of organizing data in a database. It involves organizing the columns and tables in the database to ensure that their dependencies are correctly implemented using database constraints.
It is important because of the following reasons:
- It eliminates redundant data.
- It reduces the chances of data error.
- The normalization is important because it allows the database to take up less disk space.
- It also helps in increasing the performance.
- It improves the data integrity and consistency.
Can you list and briefly describe the normal forms (1NF, 2NF, 3NF) in SQL?
Normalization can take numerous forms, the most frequent of which are 1NF (First Normal Form), 2NF (Second Normal Form), and 3NF (Third Normal Form). Here’s a quick rundown of each:
- First Normal Form (1NF): In 1NF, each table cell should contain only a single value, and each column should have a unique name. 1NF helps in eliminating duplicate data and simplifies the queries. It is the fundamental requirement for a well-structured relational database. 1NF eliminates all the repeating groups of the data and also ensures that the data is organized at its most basic granularity.
- Second Normal Form (2NF): In 2NF, it eliminates the partial dependencies, ensuring that each of the non-key attributes in the table is directly related to the entire primary key. This further reduces data redundancy and anomalies. The Second Normal form (2NF) eliminates redundant data by requiring that each non-key attribute be dependent on the primary key. In 2NF, each column should be directly related to the primary key, and not to other columns.
- Third Normal Form (3NF): Third Normal Form (3NF) builds on the Second Normal Form (2NF) by requiring that all non-key attributes are independent of each other. This means that each column should be directly related to the primary key, and not to any other columns in the same table.
What is the difference between normalization and denormalization in database design.
- Normalization is used in a database to reduce the data redundancy and inconsistency from the table. Denormalization is used to add data redundancy to execute the query as quick as possible.
S.NO | Normalization | Denormalization |
1. | Non-redundant and consistent data are stored in set schema. | Data are combined to execute a query as quick as possible |
2. | Data inconsistency and redundancy is reduced. | Addition of redundancy takes place for better execution of queries |
3. | Data integrity takes place and maintained. | Data integrity is not maintained |
4. | Data redundancy is eliminated or reduced. | Redundancy is added instead of elimination or reduction. |
5. | Number of tables is increased. | Number of tables is decreased. |
6. | Optimized the use of disk space. | Does not optimize the use of disk space. |
What are primary keys and foreign keys in SQL? Why are they important?
Primary keys and foreign keys are two fundamental concepts in SQL that are used to build and enforce connections between tables in a relational database management system (RDBMS).
- Primary key: Primary keys are used to ensure that the data in the specific column is always unique. In this, a column cannot have a NULL value. The primary key is either an existing table column or it’s specifically generated by the database itself according to a sequence.
Importance of Primary Keys:
-
- Uniqueness
-
- Query Optimization
-
- Data Integrity
-
- Relationships
-
- Data Retrieval
- Foreign key: Foreign key is a group of column or a column in a database table that provides a link between data in given two tables. Here, the column references a column of another table.
Importance of Foreign Keys:
-
- Relationships
-
- Data Consistency
-
- Query Efficiency
-
- Referential Integrity
-
- Cascade Actions
Describe the concept of a database transaction. Why is it important to maintain data integrity?
Database transactions are the set of operations that are usually used to perform logical work. Database transactions mean that data in the database has been changed. It is one of the major characteristics provided in DBMS i.e. to protect the user’s data from system failure. It is done by ensuring that all the data is restored to a consistent state when the computer is restarted. It is any one execution of the user program. Transaction’s one of the most important properties is that it contains a finite number of steps.
They are important to maintain data integrity because they ensure that the database always remains in a valid and consistent state, even in the presence of multiple users or several operations. Database transactions are essential for maintaining data integrity because they enforce ACID properties i.e, atomicity, consistency, isolation, and durability properties. Transactions provide a solid and robust mechanism to ensure that the data remains accurate, consistent, and reliable in complex and concurrent database environments. It would be challenging to guarantee data integrity in relational database systems without database transactions.
Explain how NULL values are handled in SQL queries, and how you can use functions like IS NULL and IS NOT NULL.
In SQL, NULL is a special value that usually represents that the value is not present or absence of the value in a database column. For accurate and meaningful data retrieval and manipulation, handling NULL becomes crucial. SQL provides IS NULL and IS NOT NULL operators to work with NULL values.
Data Visualizations or BI tools Interview questions
What is Power BI?
Power BI is a business analytics service by Microsoft that provides interactive visualizations and business intelligence capabilities with an interface simple enough for end-users to create their reports and dashboards.
Differentiate between Power BI Desktop, Power BI Service, and Power BI Mobile.
Power BI Desktop is used for creating reports, Power BI Service (or Power BI Online) is the cloud service for sharing and collaborating on reports, and Power BI Mobile allows users to access reports on mobile devices.
Explain the role of Power Query in Power BI.
Power Query is used for data transformation and shaping. It allows users to connect to various data sources, clean and transform data before loading it into Power BI for analysis.
What is DAX in Power BI, and why is it important?
DAX (Data Analysis Expressions) is a formula language used for creating custom calculations in Power BI. It is important as it enables users to create sophisticated measures and calculated columns.
How do you create relationships between tables in Power BI?
In Power BI Desktop, go to the “Model” view, drag and drop fields from one table to another to create relationships based on common keys.
What is the difference between a calculated column and a measure in Power BI?
A calculated column is a column added to a table, computed row by row, while a measure is a formula applied to a set of data, providing a dynamic calculation based on the context.
How can you implement row-level security in Power BI?
Row-level security in Power BI can be implemented by creating roles in Power BI Desktop and defining filters at the row level based on user roles.
Explain the purpose of the Power BI Gateway.
The Power BI Gateway allows for a secure connection between Power BI services and on-premises data sources. It facilitates refreshing datasets and running scheduled refreshes.
What is a Power BI dashboard?
A Power BI dashboard is a single-page, interactive view of your data that provides a consolidated and visualized summary of key metrics. It can include visuals, images, and live data.
How can you share a Power BI report with others?
Power BI reports can be shared through the Power BI service. Publish the report to the Power BI service, and then share it with specific users or distribute it widely within an organization.
Disclaimer: Yes-M Systems and/or their instructors reserve the right to make any changes to the syllabus as deemed necessary to best fulfill the course objectives. Students registered for this course will be made aware of any changes in a timely fashion using reasonable means.