Aug-2022 Databricks-Certified-Professional-Data-Scientist Study Material, Preparation Guide and PDF Download
Free Databricks-Certified-Professional-Data-Scientist Certification Sample Questions with Online Practice Test
Databricks Databricks-Certified-Professional-Data-Scientist Exam Syllabus Topics:
| Topic | Details |
|---|---|
| Topic 1 |
|
| Topic 2 |
|
| Topic 3 |
|
| Topic 4 |
|
| Topic 5 |
|
NEW QUESTION 50
Which of the following question statement falls under data science category?
- A. What happened in last six months?
- B. Where is a problem for sales?
- C. Which is the optimal scenario for selling this product?
- D. What happens, if these scenario continues?
- E. How many products have been sold in a last month?
Answer: C,D
Explanation:
Explanation
This question wants to check your understanding about Bl and Data Science. Bl was already existing and analytics team already using it. They need to improve and learn data science technique to solve some problems. If you check the option given in the question, it will confuse you. But if you have worked in Bl or as a Data Scientist then it is easy to answer. First 3 option can be easily answered using reporting solution, what sales happened in last six month, what was the problem etc.
But for the last two option you need to apply data science techniques like which all scenarios are optimal for product sales, you need to collect the data and applying various techniques for that. Hence, last two option can only be answered using Data Science technique And for this you need to apply techniques like Optimization, predictive modeling, statistical analysis on structured and un-structured data.
NEW QUESTION 51
Of all the smokers in a particular district, 40% prefer brand A and 60% prefer brand B.Of those smokers who prefer brand A. 30% are females, and of those who prefer brand B.40% are female. What is the probability that a randomly selected smoker prefers brand A, given that the person selected is a female?
Which of the following is a best way to solve this problem?
- A. Bays Theorem
- B. Binomial Distribution
- C. Poisson Distribution
- D. None of the above
Answer: A
NEW QUESTION 52
Clustering is a type of unsupervised learning with the following goals
- A. Find similarities in the training data
- B. 1 and 2
- C. Maximize a utility function
- D. Not to maximize a utility function
- E. 2 and 3
Answer: E
Explanation:
Explanation
type of unsupervised learning is called clustering. In this type of learning, The goal is not to maximize a utility function, but simply to find similarities in the training data.
The assumption is often that the clusters discovered will match reasonably well with an intuitive classification.
For instance, clustering individuals based on demographics might result in a clustering of the wealthy in one group and the poor in another. Clustering can be useful when there is enough data to form clusters (though this turns out to be difficult at times) and especially when additional data about members of a cluster can be used to produce further results due to dependencies in the data.
NEW QUESTION 53
Classification and regression are examples of___________.
- A. un-supervised learning
- B. Clustering
- C. Density estimation
- D. supervised learning
Answer: D
Explanation:
Explanation
In classification, our job is to predict what class an instance of data should fall into. Another task in machine learning is regression. Regression is the prediction of a numeric value. Most people have probably seen an example of regression with a best-fit line drawn through some data points to generalize the data points.
Classification and regression are examples of supervised learning. This set of problems is known as supervised because we're telling the algorithm what to predict.
NEW QUESTION 54
In which phase of the data analytics lifecycle do Data Scientists spend the most time in a project?
- A. Data Preparation
- B. Communicate Results
- C. Discovery
- D. Model Building
Answer: A
NEW QUESTION 55
Refer to the exhibit.
You are building a decision tree. In this exhibit, four variables are listed with their respective values of info-gain.
Based on this information, on which attribute would you expect the next split to be in the decision tree?
- A. Age
- B. Credit Score
- C. Income
- D. Gender
Answer: B
NEW QUESTION 56
RMSE is a good measure of accuracy, but only to compare forecasting errors of different models for a______, as it is scale-dependent.
- A. Between Variables
- B. Particular Variable
- C. All of the above are correct
- D. Among all the variables
Answer: B
Explanation:
Explanation : The RMSE serves to aggregate the magnitudes of the errors in predictions for various times into a single measure of predictive power. RMSE is a good measure of accuracy, but only to compare forecasting errors of different models for a particular variable and not between variables, as it is scale-dependent.
NEW QUESTION 57
Which of the below best describe the Principal component analysis
- A. Dimensionality reduction
- B. Regression
- C. Classification
- D. Clustering
- E. Collaborative filtering
Answer: A
NEW QUESTION 58
You are working in a data analytics company as a data scientist, you have been given a set of various types of Pizzas available across various premium food centers in a country. This data is given as numeric values like Calorie. Size, and Sale per day etc. You need to group all the pizzas with the similar properties, which of the following technique you would be using for that?
- A. Linear Regression
- B. Association Rules
- C. Grouping
- D. Naive Bayes Classifier
- E. K-means Clustering
Answer: E
Explanation:
Explanation
Using K means clustering you can create group of objects based on their properties. Where K is number of the groups. In this case, in each group you determine the center of the group and then find the how far each object characteristics from the center. If it is near the center than it can be part of the group. Suppose we have 100 objects and we need to determine 4 groups. Hence, here K=4. Now we determine 4 center values and based on that center value we determine the distance of each object from the center.
NEW QUESTION 59
Scenario: Suppose that Bob can decide to go to work by one of three modes of transportation, car, bus, or commuter train. Because of high traffic, if he decides to go by car. there is a 50% chance he will be late. If he goes by bus, which has special reserved lanes but is sometimes overcrowded, the probability of being late is only 20%. The commuter train is almost never late, with a probability of only 1 %, but is more expensive than the bus.
Suppose that Bob is late one day, and his boss wishes to estimate the probability that he drove to work that day by car. Since he does not know Which mode of transportation Bob usually uses, he gives a prior probability of
1 3 to each of the three possibilities. Which of the following method the boss will use to estimate of the probability that Bob drove to work?
- A. Linear regression
- B. Naive Bayes
- C. None of the above
- D. Random decision forests
Answer: B
Explanation:
Explanation
Bayes' theorem (also known as Bayes' rule) is a useful tool for calculating conditional probabilities.
NEW QUESTION 60
Select the correct option from the below
- A. If you've chosen supervised learning, with discrete target value like Yes/No. 1/2/3, A/B/C: or Red/Yellow/Black, then look into classification.
- B. If you're trying to predict or forecast a target value^ then you need to look into supervised learning.
- C. Are you trying to fit your data into some discrete groups? If so and that's all you need, you should look into clustering.
- D. If you're not trying to predict a target value, then you need to look into unsupervised learning
- E. If the target value can take on a number of values, say any value from 0.00 to 100.00, or -999 to 999: or
+_to -_, then you need to look unsupervised learning
Answer: A,B,C,D
Explanation:
Explanation
If you re trying to predict or forecast a target value, then you need to look into supervised learning. If not, then unsupervised learning is the place you want to be. If you've chosen supervised learning, what's your target value? Is it a discrete value like Yes/No, 1/2/3, A/B/C: or Red/Yellow/Black? If so, then you want to look into classification. If the target value can take on a number of values, say any value from 0.00 to 100.00, or-999 to
999, or+_to -_, then you need to look into regression. If you're not trying to predict a target value: then you need to look into unsupervised learning. Are you trying to fit your data into some discrete groups? If so and that's all you need, you should look into clustering. Do you need to have some numerical estimate of how strong the fit is into each group? If you answer yes then you probably should look into a density estimation algorithm.
NEW QUESTION 61
A data scientist wants to predict the probability of death from heart disease based on three risk factors: age, gender, and blood cholesterol level. What is the most appropriate method for this project?
- A. Apriori algorithm
- B. Linear regression
- C. K-means clustering
- D. Logistic regression
Answer: D
Explanation:
Explanation
Logistic regression is used widely in many fields, including the medical and social sciences. For example, the Trauma and Injury Severity Score (TRISS), which is widely used to predict mortality in injured patients, was originally developed by Boyd et al. using logistic regression. Many other medical scales used to assess severity of a patient have been developed using logistic regression. Logistic regression may be used to predict whether a patient has a given disease (e.g. diabetes; coronary heart disease), based on observed characteristics of the patient (age, sex, body mass index, results of various blood tests, etc.; age, blood cholesterol level, systolic blood pressure, relative weight, blood hemoglobin level, smoking (at 3 levels), and abnormal electrocardiogram.).Another example might be to predict whether an American voter will vote Democratic or Republican, based on age, income, sex, race, state of residence, votes in previous elections, etc. The technique can also be used in engineering, especially for predicting the probability of failure of a given process, system or product. It is also used in marketing applications such as prediction of a customer's propensity to purchase a product or halt a subscription, etc.[citation needed] In economics it can be used to predict the likelihood of a person's choosing to be in the labor force, and a business application would be to predict the likelihood of a homeowner defaulting on a mortgage. Conditional random fields, an extension of logistic regression to sequential data, are used in natural language processing.
NEW QUESTION 62
You have modeled the datasets with 5 independent variables called A,B,C,D and E having relationships which is not dependent each other, and also the variable A,B and C are continuous and variable D and E are discrete (mixed mode).
Now you have to compute the expected value of the variable let say A, then which of the following computation you will prefer
- A. Integration
- B. Differentiation
- C. Generalization
- D. Transformation
Answer: A
Explanation:
Explanation
Text Description automatically generated
Text Description automatically generated
Text Description automatically generated
NEW QUESTION 63
As a data scientist consultant at ABC Corp, you are working on a recommendation engine for the learning resources for end user. So Which recommender system technique benefits most from additional user preference data?
- A. Content-based filtering
- B. Logistic Regression
- C. Item-based collaborative filtering
- D. Naive Bayes classifier
Answer: C
Explanation:
Explanation
Item-based scales with the number of items, and user-based scales with the number of users you have. If you have something like a store, you'll have a few thousand items at the most. The biggest stores at the time of writing have around 100,000 items. In the Netflix competition, there were 480,000 users and 17,700 movies. If you have a lot of users: then you'll probably want to go with item-based similarity. For most product-driven recommendation engines, the number of users outnumbers the number of items. There are more people buying items than unique items for sale. Item-based collaborative filtering makes predictions based on users preferences for items. More preference data should be beneficial to this type of algorithm. Content-based filtering recommender systems use information about items or users, and not user preferences, to make recommendations. Logistic Regression, Power iteration and a Naive Bayes classifier are not recommender system techniques.
NEW QUESTION 64
The method based on principal component analysis (PCA) evaluates the features according to
- A. According to the magnitude of the components of the discriminate vector
- B. The projection of the largest eigenvector of the correlation matrix on the initial dimensions
- C. The projection of the smallest eigenvector of the correlation matrix on the initial dimensions
- D. None of the above
Answer: B
Explanation:
Explanation
Feature Selection:
The method based on principal component analysis (PCA) evaluates the features according to the projection of the largest eigenvector of the correlation matrix on the initial dimensions, the method based on Fisher's linear discriminate analysis evaluates. Them according to the magnitude of the components of the discriminate vector.
NEW QUESTION 65
Refer to image below
- A. Option C
- B. Option B
- C. Option A
- D. Option D
Answer: C
Explanation:
Explanation
Text Description automatically generated
NEW QUESTION 66
Select the correct problems which can be solved using SVMs
- A. SVMs are also useful in medical science to classify proteins with up to 90% of the compounds classified correctly
- B. Hand-written characters can be recognized using SVM
- C. SVMs are helpful in text and hypertext categorization
- D. Classification of images can also be performed using SVMs
Answer: A,B,C,D
Explanation:
Explanation
SVMs can be used to solve various real world problems:
* SVMs are helpful in text and hypertext categorization as their application can significantly reduce the need for labeled training instances in both the standard inductive and transductive settings.
* Classification of images can also be performed using SVMs. Experimental results show that SVMs achieve significantly higher search accuracy than traditional query refinement schemes after just three to four rounds of relevance feedback.
* SVMs are also useful in medical science to classify proteins with up to 90% of the compounds classified correctly.
* Hand-written characters can be recognized using SVM
NEW QUESTION 67
If E1 and E2 are two events, how do you represent the conditional probability given that E2 occurs given that E1 has occurred?
- A. P(E1)/P(E2)
- B. P(E1+E2)/P(E1)
- C. P(E2)/P(E1)
- D. P(E2)/(P(E1+E2)
Answer: C
NEW QUESTION 68
Which of the following statement is true for the R square value in the regression model?
- A. R-squared never decreases upon adding more independent variables.
- B. When R square =0, all the residual are equal to 1
- C. When R square =1 , all the residuals are equal to 0
- D. R square can be increased by adding more variables to the model.
Answer: A,C,D
Explanation:
Explanation
R square can be made high, it means when we add more variables R-square will increase. And R-square will never decreases if you add more independent variables. Higher R square value can have lower the residuals.
NEW QUESTION 69
......
Databricks-Certified-Professional-Data-Scientist Certification Study Guide Pass Databricks-Certified-Professional-Data-Scientist Fast: https://examtorrent.actual4test.com/Databricks-Certified-Professional-Data-Scientist_examcollection.html