Over a million developers have joined DZone.

Data Science and ML: A Complete Interview Guide

DZone 's Guide to

Data Science and ML: A Complete Interview Guide

In this post, we will be looking at a different set of interview questions that can certainly help if you are planning to shift to your career towards data science.

· AI Zone ·
Free Resource

The constant evolution of technology has meant that data and information are being generated at a rate unlike ever before, and it’s only on the rise. The demand for people skilled in analyzing, interpreting, and using this data is already high and is set to grow exponentially over the coming years. These new roles cover all aspects from strategy, to operations, to governance. Hence, the current and future demand will require more data scientists, data engineers, data strategists, and Chief Data Officers.

In this post, we will be looking at a different set of interview questions that can certainly help if you are planning to shift to your career towards data science.

Category of Interview Questions


1. Name and explain few methods/techniques used in Statistics for analyzing the data?

Arithmetic Mean:
It is an important technique in statistics. Arithmetic Mean can also be called an average. It is the number or the quantity obtained by summing two or more numbers/variables and then dividing the sum by the number of numbers/variables.

Median is also a way of finding the average of a group of data points. It’s the middle number of a set of numbers. There are two possibilities, the data points can be an odd number group or it can be en even number group.

If the group is odd, arrange the numbers in the group from smallest to largest. The median will be the one which is exactly sitting in the middle, with an equal number on either side of it. If the group is even, arrange the numbers in order and pick the two middle numbers and add them then divide by 2. It will be the median number of that set.

The mode is also one of the types for finding the average. A mode is a number that occurs most frequently in a group of numbers. Some series might not have any mode; some might have two modes which is called bimodal series.

In the study of statistics, the three most common ‘averages’ in statistics are mean, median, and mode.

Standard Deviation (Sigma):
Standard Deviation is a measure of how much your data is spread out in statistics.

Regression is an analysis in statistical modeling. It’s a statistical process for measuring the relationships among the variables; it determines the strength of the relationship between one variable and a series of other changing independent variables.

2. Explain statistics branches.

The two main branches of statistics are descriptive statistics and inferential statistics.

Descriptive statistics: Descriptive statistics summarizes the data from a sample using indexes such as mean or standard deviation.

Descriptive Statistics, methods include displaying, organizing and describing the data.

Inferential Statistics: Inferential Statistics draws the conclusions from data that are subject to random variation such as observation errors and sample variation.

3. List all the other models that work with statistics to analyze the data?

Statistics along with Data Analytics analyzes the data and help business to make good decisions. Predictive ‘Analytics’ and ‘Statistics’ are useful to analyze current data and historical data to make predictions about future events.

4. List the fields where statistics can be used.

Statistics can be used in many research fields. Below are the lists of files in which statistics can be used

  • Science
  • Technology
  • Business
  • Biology
  • Computer Science
  • Chemistry
  • It aids in decision making
  • Provides comparison
  • Explains action that has taken place
  • Predict the future outcome
  • Estimate of unknown quantities.

5. What is a linear regression in statistics?

Linear regression is one of the statistical techniques used in a predictive analysis. This technique will identify the strength of the impact that the independent variables show on deepened variables.

6. What is a Sample in Statistics? List the sampling methods.

In a Statistical study, a Sample is nothing but a set of or a portion of collected or processed data from a statistical population by a structured and defined procedure and the elements within the sample are known as a sample point.

Below are the 4 sampling methods:

  • Cluster Sampling: IN cluster sampling method the population will be divided into groups or clusters.
  • Simple Random: This sampling method simply follows the pure random division.
  • Stratified: In stratified sampling, the data will be divided into groups or strata.
  • Systematical: Systematical sampling method picks every kth member of the population.

7. What is P-value? Explain it.

When we execute a hypothesis test in statistics, a p-value helps us in determine the significance of our results. These hypothesis tests are nothing but to test the validity of a claim that is made about a population. A null hypothesis is a situation when the hypothesis and the specified population is with no significant difference due to sampling or experimental error.

8. What is Data Science and what is the relationship between Data science and Statistics?

Data Science is simply data-driven science, and it also involves the interdisciplinary field of automated scientific methods, algorithms, systems, and process to extracts the insights and knowledge from data in any form, either structured or unstructured. Furthermore, it has similarities with data mining Both abstracts the useful information from data.

Data Sciences include Mathematical Statistics along with computer science and applications. Also, by combing aspects of statistics, visualization, applied mathematics, and computer science, Data Science is turning the vast amount of data into insights and knowledge.

Similarly, Statistics is one of the main components of Data Science. Statistics is a branch of mathematics commerce with the collection, analysis, interpretation, organization, and presentation of data.

9. What is correlation and covariance in statistics?

Covariance and correlation are two mathematical concepts; these two approaches are widely used in statistics. Both correlation and covariance establish the relationship and also measure the dependency between two random variables. Though the work is similar between these two in mathematical terms, they are different from each other.

Correlation: Correlation is considered or described as the best technique for measuring and also for estimating the quantitative relationship between two variables. Correlation measures how strongly two variables are related.

Covariance: In covariance, two items vary together, and it’s a measure that indicates the extent to which two random variables change in cycle. It is a statistical term; it explains the systematic relation between a pair of random variables, wherein changes in one variable reciprocal by a corresponding change in another variable.


R Interview Questions

1. Explain what R is.

R is data analysis software that is used by analysts, quants, statisticians, data scientists, and others.

2. List out some of the functions that R provides.

The function that R provides are

  • Mean
  • Median
  • Distribution
  • Covariance
  • Regression
  • Non-linear
  • Mixed Effects
  • GLM
  • GAM

3. Explain how you can start the R Commander GUI.

Typing the command, (“Rcmdr”) into the R console starts the R Commander GUI.

4. In R, how you can import data?

You use R commander to import Data in R, and there are three ways through which you can enter data into it

  • You can enter data directly via Data New Data Set
  • Import data from a plain text (ASCII) or other files (SPSS, Minitab, etc.)
  • Read a dataset either by typing the name of the data set or selecting the data set in the dialogue box

5. Explain what the R language does NOT do.

  • Though R programming can easily connect to DBMS is not a database
  • R does not consist of any graphical user interface
  • Though it connects to Excel/Microsoft Office easily, R language does not provide any spreadsheet view of data

6. Explain how R commands are written.

In R, anywhere in the program, you have to preface the line of code with a #sign, for example

  • # subtraction
  • # division
  • # note order of operations exists

7. How can you save your data in R?

To save data in R, there are many ways, but the easiest way of doing this is

Go to Data > Active Data Set > Export Active dataset and a dialogue box will appear. When you click ok, the dialogue box lets you save your data in the usual way.

8. Explain how you can produce co-relations and covariances.

You can produce co-relations by the cor () function to produce co-relations and cov() function to produce covariances.

9. Explain what t-tests are in R.

In R, the t.test () function produces a variety of t-tests. The t-test is the most common test in statistics and is used to determine whether the means of two groups are equal to each other.

10. Explain what the With() and By() functions in R are used for.

  • With() function is similar to DATA in SAS, it applies an expression to a dataset.
  • BY() function applies a function to each level of factors. It is similar to BY processing in SAS.

11. What are the data structures in R that are used to perform statistical analyses and create graphs?

R has data structures like:

  • Vectors
  • Matrices
  • Arrays
  • Data frames

12. Explain the general format of matrices in R.

General format is

1 2 Mymatrix< - matrix (vector, nrow=r , ncol=c , byrow=FALSE, dimnames = list ( char_vector_ rowname, char_vector_colnames))

13. In R, how are missing values represented?

In R missing values are represented by NA (Not Available), while impossible values are represented by the symbol NaN (not a number).

14. What is transpose?

For re-shaping data before, analysis R provides a various method and transpose are the simplest methods of reshaping a dataset. To transpose a matrix or a data frame, t() function is used.

15. How is data aggregated in R?

By collapsing data in R by using one or more BY variables, it becomes easy. When using the aggregate() function, the BY variable should be in the list.

Machine Learning

1. What is Machine Learning?

Machine Learning is an application of Artificial Intelligence that provides systems with the ability to automatically learn and improve from experience without being explicitly programmed. Also, Machine Learning focuses on the development of computer programs that can access data and use it to learn for themselves.

2. Give an example that explains Machine Leaning in industries.

Robots are replacing humans in many areas. It is because robots are programmed such that they can perform tasks based on data they gather from sensors. They learn from the data and behave intelligently.

3. What are the different Algorithm techniques in Machine Learning?

The different types of Algorithm techniques in Machine Learning are as follows:
Reinforcement Learning
• Supervised Learning
• Unsupervised Learning
• Semi-supervised Learning
• Transduction
• Learning to Learn

4. What is the difference between supervised and unsupervised Machine Learning?

This is a basic Machine Learning interview question. Supervised learning is a process where it requires training labeled data, while unsupervised learning doesn’t require data labeling.

5. What is the function of unsupervised Learning?

The functions of unsupervised learning are below:
• Find clusters of the data of the data
• Low-dimensional representations of the data
• Gaining interesting directions in data
• Interesting coordinates and correlations
• Figuring novel observations

6. What is the function of supervised Learning?

The functions of supervised learning are below:
• Classifications
• Speech recognition
• Regression
• Predict time series
• Annotate strings

7. What are the advantages of Naive Bayes?

The advantages of Naive Bayes are:
• The classifier will converge quicker than discriminative models
• It cannot learn the interactions between features

8. What are the disadvantages of Naive Bayes?

The disadvantages of Naive Bayes are:
• The problem arises for continuous features
• It makes a very strong assumption on the shape of your data distribution
• Does not work well in case of data scarcity

9. Why is naive Bayes so naive?

Naive Bayes is so naive because it assumes that all of the features in a dataset are equally important and independent.

10. What is overfitting in Machine Learning?

This is a popular Machine Learning interview question asked in an interview. Overfitting in Machine Learning is defined as when a statistical model describes random error or noise instead of underlying relationship or when a model is excessively complex.

11. What are the conditions when overfitting happens?

One of the important reasons and possibilities of overfitting is because the criteria used for training the model is not the same as the criteria used to judge the efficacy of a model.

12. How can you avoid overfitting?

We can avoid overfitting by using:
• Lots of data
• Cross-validation

13. What are the five popular algorithms for Machine Learning?

Below is the list of five popular algorithms of Machine Learning:
• Decision Trees
• Probabilistic networks
• Nearest Neighbor
• Support vector machines
• Neural Networks

14. What are the different use cases where Machine Learning algorithms can be used?

The different use cases where Machine Learning algorithms can be used are as follows:
• Fraud Detection
• Face detection
• Natural language processing
• Market Segmentation
• Text Categorization
• Bioinformatics

15. What are parametric models and Non-Parametric models?

Parametric models are those with a finite number of parameters and to predict new data, you only need to know the parameters of the model.

Non-parametric models are those with an unbounded number of parameters, allowing for more flexibility. To predict new data, you need to know the parameters of the model and the state of the data that has been observed.

16. What are the three stages to build the hypotheses or model in Machine Learning?

This is a frequently asked Machine Learning interview questions. The three stages to build the hypotheses or model in Machine Learning are:
1. Model building
2. Model testing
3. Applying the model

17. What is Inductive Logic Programming in Machine Learning (ILP)?

Inductive Logic Programming (ILP) is a subfield of Machine Learning that uses logical programming representing background knowledge and examples.

18. What is the difference between classification and regression?

The difference between classification and regression are as follows:
• Classification is about identifying group membership while regression technique involves predicting a response.
• Both the techniques are related to prediction
• Classification predicts the belonging to a class whereas regression predicts the value from a continuous set
• Regression is not preferred when the results of the model need to return the belongingness of data points in a dataset with specific explicit categories

19. What is the difference between inductive Machine Learning and deductive Machine Learning?

The difference between inductive Machine Learning and deductive Machine Learning are as follows:
Machine Learning where the model learns by examples from a set of observed instances to draw a generalized conclusion whereas in deductive learning the model first draws the conclusion and then the conclusion is drawn.

20. What are the advantages of decision trees?

The advantages decision trees are:
• Decision trees are easy to interpret
• Nonparametric
• There are relatively few parameters to tune

data science ,machine learning ,interview question and answers ,artificial intelligence

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}