100+ Data Science Interview Questions and Answers for 2022

Table of Contents

Basic Data Science Interview Questions 

Q1. What is the difference between data science and big data?

Ans. The common differences between data science and big data are –

Big DataData Science
Large collection of data sets that cannot be stored in a traditional systemAn interdisciplinary field that includes analytical aspects, statistics, data mining, machine learning, etc.
Popular in the field of communication, purchase and sale of goods, financial services, and educational sectorCommon applications are digital advertising, web research, recommendation systems (Netflix, Amazon, Facebook), speech and handwriting recognition applications
Big Data solves problems related to data management and handling, and analyze insights resulting in informed decision makingData Science uses machine learning algorithms and statistical methods to obtain accurate predictions from raw data
Popular tools are Hadoop, Spark, Flink, NoSQL, Hive, etc.Popular tools are Python, R, SAS, SQL, etc.

This question is among the basic data science interview questions and you must prepare for such questions.

Q2. How do you check for data quality?

Ans. Some of the definitions used to check for data quality are:

  • Completeness
  • Consistency
  • Uniqueness
  • Integrity
  • Conformity
  • Accuracy 

Q3. Suppose you are given survey data, and it has some missing data, how would you deal with missing values ​​from that survey?

Ans. This is among the important data science interview questions. There are two main techniques for dealing with missing values – 

Debugging Techniques – It is a Data Cleaning process consisting of evaluating the quality of the information collected, increasing its quality, in order to avoid lax analysis. The most popular debugging techniques are – 

Searching the list of values: It is about searching the data matrix for values ​​that are outside the response range. These values ​​can be considered as missing, or the correct value can be estimated from other variables

Filtering questions: It is about comparing the number of responses of a filter category and another filtered category. If any anomaly is observed that cannot be solved, it will be considered as a lost value.

Checking for Logical Consistencies: The answers that may be considered contradictory to each other are checked.

Counting the Level of representativeness: A count is made of the number of responses obtained in each variable. If the number of unanswered questions is very high, it is possible to assume equality between the answers and the non-answers or to make an imputation of the non-answer.

  • Imputation Technique

This technique consists of replacing the missing values ​​with valid values ​​or answers by estimating them. There are three types of imputation:

  • Random imputation
  • Hot Deck imputation 
  • Imputation of the mean of subclasses

Q4. How would you deal with missing random values ​​from a data set?

Ans. There are two forms of randomly missing values:

MCAR or Missing completely at random. Such errors happen when the missing values are randomly distributed across all observations. 

We can confirm this error by partitioning the data into two parts –

  1. One set with the missing values
  2. Another set with the non-missing values. 

After we have partitioned the data, we conduct a t-test of mean difference to check if there is any difference in the sample between the two data sets.

In case the data are MCAR, we may choose a pair-wise or a list-wise deletion of missing value cases.   

MAR or Missing at random. It is a common occurrence. Here, the missing values are not randomly distributed across observations but are distributed within one or more sub-samples. We cannot predict the probability from the variables in the model. Data imputation is mainly performed to replace them.

Q5. What is Hadoop, and why should I care?

Ans. Hadoop is an open-source processing framework that manages data processing and storage for big data applications running on pooled systems.

Apache Hadoop is a collection of open-source utility software that makes it easy to use a network of multiple computers to solve problems involving large amounts of data and computation. It provides a software framework for distributed storage and big data processing using the MapReduce programming model.

Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packets of code to nodes to process the data in parallel. This allows the data set to be processed faster and more efficiently than if conventional supercomputing architecture were used.

Q6. What is ‘fsck’?

Ans. ‘fsck ‘ abbreviation for ‘ file system check.’ It is a type of command that searches for possible errors in the file. fsck generates a summary report, which lists the file system’s overall health and sends it to the Hadoop distributed file system.

This is among the important data science interview questions and you must prepare for the related terminologies as well.

Q7. Which is better – good data or good models?

Ans. This might be one of the frequently asked data science interview questions.

The answer to this question is very subjective and depends on the specific case. Big companies prefer good data; it is the foundation of any successful business. On the other hand, good models couldn’t be created without good data.

Based on your personal preference, you will probably choose no right or wrong answer (unless the company requires one specifically).

Q8. What are Recommender Systems?

Ans. Recommender systems are a subclass of information filtering systems, used to predict how users would rate or score particular objects (movies, music, merchandise, etc.). Recommender systems filter large volumes of information based on the data provided by a user and other factors, and they take care of the user’s preference and interest.

Recommender systems utilize algorithms that optimize the analysis of the data to build the recommendations. They ensure a high level of efficiency as they can associate elements of our consumption profiles such as purchase history, content selection, and even our hours of activity, to make accurate recommendations.

Q9. What are the different types of Recommender Systems?

Ans. There are three main types of Recommender systems.

Collaborative filtering – Collaborative filtering is a method of making automatic predictions by using the recommendations of other people. There are two types of collaborative filtering techniques –

  • User-User collaborative filtering
  • Item-Item collaborative filtering

Content-Based Filtering– Content-based filtering is based on the description of an item and a user’s choices. As the name suggests, it uses content (keywords) to describe the items, and the user profile is built to state the type of item this user likes.

Image – Collaborative filtering & Content-based filtering

Hybrid Recommendation Systems – Hybrid Recommendation engines are a combination of diverse rating and sorting algorithms. A hybrid recommendation engine can recommend a wide range of products to consumers as per their history and preferences with precision.

Q10. Differentiate between wide and long data formats.

Ans. In a wide format, categorical data are always grouped.

The long data format is in which there are a number of instances with many variables and subject variables.

Q11. What are Interpolation and Extrapolation?

Ans. Interpolation – This is the method to guess data points between data sets. It is a prediction between the given data points.

Extrapolation – This is the method to guess data point beyond data sets. It is a prediction beyond given data points.

Q12. How much data is enough to get a valid outcome?

Ans. All the businesses are different and measured in different ways. Thus, you never have enough data and there will be no right answer. The amount of data required depends on the methods you use to have an excellent chance of obtaining vital results.

Q13. What is the difference between ‘expected value’ and ‘average value’?

Ans. When it comes to functionality, there is no difference between the two. However, they are used in different situations.

An expected value usually reflects random variables, while the average value reflects the population sample.

Q14. What happens if two users access the same HDFS file at the same time?

Ans. This is a bit of a tricky question. The answer itself is not complicated, but it is easy to confuse by the similarity of programs’ reactions.

When the first user is accessing the file, the second user’s inputs will be rejected because HDFS NameNode supports exclusive write.

Q15. What is power analysis?

Ans. Power analysis allows the determination of the sample size required to detect an effect of a given size with a given degree of confidence.

Q16. Is it better to have too many false negatives or too many false positives?

Ans. This is among the popularly asked data science interview questions and will depend on how you show your viewpoint. Give examples

These are some of the popular data science interview questions. Always be prepared to answer all types of data science interview questions— technical skills, interpersonal, leadership, or methodologies. If you are someone who has recently started your career in Data Science, you can always get certified to improve your skills and boost your career opportunities.

Statistics Interview Questions

Q17. What is the importance of statistics in data science?

Ans. Statistics help data scientists to get a better idea of a customer’s expectations. Using statistical methods, data Scientists can acquire knowledge about consumer interest, behavior, engagement, retention, etc. It also helps to build robust data models to validate certain inferences and predictions.

Q18. What are the different statistical techniques used in data science?

Ans. There are many statistical techniques used in data science, including –

The arithmetic mean – It is a measure of the average of a set of data

Graphic display – Includes charts and graphs to visually display, analyze, clarify, and interpret numerical data through histograms, pie charts, bars, etc.

Correlation – Establishes and measures relationships between different variables

Regression – Allows identifying if the evolution of one variable affects others

Time series – It predicts future values ​​by analyzing sequences of past values

Data mining and other Big Data techniques to process large volumes of data

Sentiment analysis – It determines the attitude of specific agents or people towards an issue, often using data from social networks

Semantic analysis – It helps to extract knowledge from large amounts of texts

A / B testing – To determine which of two variables works best with randomized experiments

Machine learning using automatic learning algorithms to ensure excellent performance in the presence of big data

Q19. What is an RDBMS? Name some examples for RDBMS?

Ans.  This is among the most frequently asked data science interview questions.

A relational database management system (RDBMS) is a database management system that is based on a relational model.

Some examples of RDBMS are MS SQL Server, IBM DB2, Oracle, MySQL, and Microsoft Access.

Interviewers often ask such data science interview questions and you must prepare for such abbreviations.

Q20. What are a Z test, Chi-Square test, F test, and T-test?

Ans. Z test is applied for large samples. Z test = (Estimated Mean – Real Mean)/ (square root real variance / n).

Chi-Square test is a statistical method assessing the goodness of fit between a set of observed values and those expected theoretically.

F-test is used to compare 2 populations’ variances. F = explained variance/unexplained variance.

T-test is applied for small samples. T-test = (Estimated Mean – Real Mean)/ (square root Estimated variance / n).

Q21. What does P-value signify about the statistical data?

Ans. The p-value is the probability for a given statistical model that, when the null hypothesis is true, the statistical summary would be the same as or more extreme than the actual observed results.


P-value>0.05, it denotes weak evidence against the null hypothesis which means the null hypothesis cannot be rejected.

P-value <= 0.05 denotes strong evidence against the null hypothesis which means the null hypothesis can be rejected.

P-value=0.05is the marginal value indicating it is possible to go either way

Q22. Differentiate between univariate, bivariate, and multivariate analysis.

Ans. Univariate analysis is the simplest form of statistical analysis where only one variable is involved.

Bivariate analysis is where two variables are analyzed and in multivariate analysis, multiple variables are examined.

Q23. What is association analysis? Where is it used?

Ans. Association analysis is the task of uncovering relationships among data. It is used to understand how the data items are associated with each other.

Q24. What is the difference between squared error and absolute error?

Ans. Squared error measures the average of the squares of the errors or deviations—that is, the difference between the estimator and what is estimated.

Absolute error is the difference between the measured or inferred value of a quantity and its actual value.

Q25. What is an API? What are APIs used for?

Ans. API stands for Application Program Interface and is a set of routines, protocols, and tools for building software applications.

With API, it is easier to develop software applications.

Q26. What is Collaborative filtering?

Ans. Collaborative filtering is a method of making automatic predictions by using the recommendations of other people.

Q27. Why do data scientists use combinatorics or discrete probability?

Ans. It is used because it is useful in studying any predictive model.

Q28. What do you understand by Recall and Precision?

Ans. Precision is the fraction of retrieved instances that are relevant, while Recall is the fraction of relevant instances that are retrieved.

Q29. What is market basket analysis?

Ans. Market Basket Analysis is a modeling technique based upon the theory that if you buy a certain group of items, you are more (or less) likely to buy another group of items.

Q30. What is the central limit theorem?

Ans. The central limit theorem states that the distribution of an average will tend to be Normal as the sample size increases, regardless of the distribution from which the average is taken except when the moments of the parent distribution do not exist.

Additional Reading