Balaram Panigrahy: Data Science Interview Questions

Q21. What Are Confounding Variables?

Ans. In statistics, a confounder is a variable that influences both the dependent variable and independent variable.

For example, if you are researching whether a lack of exercise leads to weight gain, lack of exercise = independent variable weight gain = dependent variable.

A confounding variable here would be any other variable that affects both of these variables, such as the age of the subject.

Q22. What Are the Types of Biases That Can Occur During Sampling?

Ans.

• Selection bias

• Under coverage bias

• Survivorship bias

Q23. What is Survivorship Bias?

Ans. It is the logical error of focusing aspects that support surviving some process and casually overlooking those that did not work because of their lack of prominence. This can lead to wrong conclusions in numerous different means.

Q24. What is selection Bias?

Ans. Selection bias occurs when the sample obtained is not representative of the population intended to be analysed.

Q25. Explain how a ROC curve works?

Ans. The ROC curve is a graphical representation of the contrast between true positive rates and false-positive rates at various thresholds. It is often used as a proxy for the trade-off between the sensitivity(true positive rate) and false-positive rate.

Q26. What is TF/IDF vectorization?

Ans. TF–IDF is short for term frequency-inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining.

The TF–IDF value increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

Q27. Why we generally use Softmax non-linearity function as last operation in-network?

Ans. It is because it takes in a vector of real numbers and returns a probability distribution. Its definition is as follows. Let x be a vector of real numbers (positive, negative, whatever, there are no constraints).

Then the i’th component of Softmax(x) is —

It should be clear that the output is a probability distribution: each element is non-negative and the sum over all components is 1.

Q28. Python or R – Which one would you prefer for text analytics?

Ans. We will prefer Python because of the following reasons

• Python would be the best option because it has Pandas library that provides easy to use data structures and high-performance data analysis tools.

• R is more suitable for machine learning than just text analysis.

• Python performs faster for all types of text analytics.

Q29. How does data cleaning plays a vital role in the analysis?

Ans. Data cleaning can help in analysis because:

• Cleaning data from multiple sources helps to transform it into a format that data analysts or data scientists can work with.
• Data Cleaning helps to increase the accuracy of the model in machine learning.
• It is a cumbersome process because as the number of data sources increases, the time taken to clean the data increases exponentially due to the number of sources and the volume of data generated by these sources.
• It might take up to 80% of the time for just cleaning data making it a critical part of the analysis task.

Q30. Differentiate between univariate, bivariate and multivariate analysis.

Ans. Univariate analyses are descriptive statistical analysis techniques which can be differentiated based on the number of variables involved at a given point of time. For example, the pie charts of sales based on territory involve only one variable and can the analysis can be referred to as univariate analysis.

The bivariate analysis attempts to understand the difference between two variables at a time as in a scatterplot. For example, analyzing the volume of sale and spending can be considered as an example of bivariate analysis.

Multivariate analysis deals with the study of more than two variables to understand the effect of variables on the responses

Balaram Panigrahy

Thursday, August 12, 2021

Data Science Interview Questions - Part-3

No comments:

Post a Comment

Popular Posts

Author