Exam2pass
0 items Sign In or Register
  • Home
  • IT Exams
  • Guarantee
  • FAQs
  • Reviews
  • Contact Us
  • Demo
Exam2pass > Databricks > Databricks Certifications > DATABRICKS-MACHINE-LEARNING-ASSOCIATE > DATABRICKS-MACHINE-LEARNING-ASSOCIATE Online Practice Questions and Answers

DATABRICKS-MACHINE-LEARNING-ASSOCIATE Online Practice Questions and Answers

Questions 4

Which of the following hyperparameter optimization methods automatically makes informed selections of hyperparameter values based on previous trials for each iterative model evaluation?

A. Random Search

B. Halving Random Search

C. Tree of Parzen Estimators

D. Grid Search

Buy Now

Correct Answer: C

Tree of Parzen Estimators (TPE) is a sequential model-based optimization algorithm that selects hyperparameter values based on the outcomes of previous trials. It models the probability density of good and bad hyperparameter values and

makes informed decisions about which hyperparameters to try next. This approach contrasts with methods like random search and grid search, which do not use information from previous trials to guide the search process.

References:

Hyperopt and TPE

Questions 5

A data scientist is working with a feature set with the following schema:

Thecustomer_idcolumn is the primary key in the feature set. Each of the columns in the feature set has missing values. They want to replace the missing values by imputing a common value for each feature.

Which of the following lists all of the columns in the feature set that need to be imputed using the most common value of the column?

A. customer_id, loyalty_tier

B. loyalty_tier

C. units

D. spend

E. customer_id

Buy Now

Correct Answer: B

For the feature set schema provided, the columns that need to be imputed using the most common value (mode) are typically the categorical columns. In this case, loyalty_tieris the only categorical column that should be imputed using the

most common value.customer_idis a unique identifier and should not be imputed, whilespendandunits are numerical columns that should typically be imputed using the mean or median values, not the mode.

References:

Databricks documentation on missing value imputation: Handling Missing Data If you need any further clarification or additional questions answered, please let me know!

Questions 6

A data scientist uses 3-fold cross-validation and the following hyperparameter grid when optimizing model hyperparameters via grid search for a classification problem:

Hyperparameter 1: [2, 5, 10] Hyperparameter 2: [50, 100]

Which of the following represents the number of machine learning models that can be trained in parallel during this process?

A. 3

B. 5

C. 6

D. 18

Buy Now

Correct Answer: D

To determine the number of machine learning models that can be trained in parallel, we need to calculate the total number of combinations of hyperparameters. The given hyperparameter grid includes:

Hyperparameter 1: [2, 5, 10] (3 values)

Hyperparameter 2: [50, 100] (2 values)

The total number of combinations is the product of the number of values for each hyperparameter:3 (values of Hyperparameter 1)? (values of Hyperparameter 2)=63 (value s of Hyperparameter 1)? (values of Hyperparameter 2)=6 With 3-fold

cross-validation, each combination of hyperparameters will be evaluated 3 times. Thus, the total number of models trained will

be:6 (combinations)? (folds)=186 (combinations)? (folds)=18 However, the number of models that can be trained in parallel is equal to the number of hyperparameter combinations, not the total number of models considering cross-validation.

Therefore, 6 models can be trained in parallel.

References:

Databricks documentation on hyperparameter tuning: Hyperparameter Tuning

Questions 7

A health organization is developing a classification model to determine whether or not a patient currently has a specific type of infection. The organization's leaders want to maximize the number of positive cases identified by the model.

Which of the following classification metrics should be used to evaluate the model?

A. RMSE

B. Precision

C. Area under the residual operating curve

D. Accuracy

E. Recall

Buy Now

Correct Answer: E

When the goal is to maximize the identification of positive cases in a classification task, the metric of interest isRecall. Recall, also known as sensitivity, measures the proportion of actual positives that are correctly identified by the model (i.e.,

the true positive rate). It is crucial for scenarios where missing a positive case (false negative) has serious implications, such as in medical diagnostics. The other metrics like Precision, RMSE, and Accuracy serve different aspects of

performance measurement and are not specifically focused on maximizing the detection of positive cases alone.

References:

Classification Metrics in Machine Learning (Understanding Recall).

Questions 8

A machine learning engineer has grown tired of needing to install the MLflow Python library on each of their clusters. They ask a senior machine learning engineer how their notebooks can load the MLflow library without installing it each time. The senior machine learning engineer suggests that they use Databricks Runtime for Machine Learning.

Which of the following approaches describes how the machine learning engineer can begin using Databricks Runtime for Machine Learning?

A. They can add a line enabling Databricks Runtime ML in their init script when creating their clusters.

B. They can check the Databricks Runtime ML box when creating their clusters.

C. They can select a Databricks Runtime ML version from the Databricks Runtime Version dropdown when creating their clusters.

D. They can set the runtime-version variable in their Spark session to "ml".

Buy Now

Correct Answer: C

The Databricks Runtime for Machine Learning includes pre-installed packages and libraries essential for machine learning and deep learning, including MLflow. To use it, the machine learning engineer can simply select an appropriate

Databricks Runtime ML version from the "Databricks Runtime Version" dropdown menu while creating their cluster. This selection ensures that all necessary machine learning libraries, including MLflow, are pre-installed and ready for use,

avoiding the need to manually install them each time.

References:

Databricks documentation on creating clusters:

https://docs.databricks.com/clusters/create.html

Questions 9

An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one-hot encoded within the feature repository.

Which of the following explanations justifies this suggestion?

A. One-hot encoding is a potentially problematic categorical variable strategy for some machine learning algorithms.

B. One-hot encoding is dependent on the target variable's values which differ for each apaplication.

C. One-hot encoding is computationally intensive and should only be performed on small samples of training sets for individual machine learning problems.

D. One-hot encoding is not a common strategy for representing categorical feature variables numerically.

Buy Now

Correct Answer: A

The suggestion not to one-hot encode categorical feature variables within the feature repository is justified because one-hot encoding can be problematic for some machine learning algorithms. Specifically, one-hot encoding increases the dimensionality of the data, which can be computationally expensive and may lead to issues such as multicollinearity and overfitting. Additionally, some algorithms, such as tree-based methods, can handle categorical variables directly without requiring one-hot encoding. References: Databricks documentation on feature engineering: Feature Engineering

Questions 10

A data scientist has defined a Pandas UDF function predict to parallelize the inference process for a single-node model:

They have written the following incomplete code block to use predict to score each record of Spark DataFramespark_df:

Which of the following lines of code can be used to complete the code block to successfully complete the task?

A. predict(*spark_df.columns)

B. mapInPandas(predict)

C. predict(Iterator(spark_df))

D. mapInPandas(predict(spark_df.columns))

E. predict(spark_df.columns)

Buy Now

Correct Answer: B

To apply the Pandas UDFpredictto each record of a Spark DataFrame, you use themapInPandasmethod. This method allows the Pandas UDF to operate on partitions of the DataFrame as pandas DataFrames, applying the specified function

(predictin this case) to each partition. The correct code completion to execute this is simply mapInPandas(predict), which specifies the UDF to use without additional arguments orincorrect function calls.References:

PySpark DataFrame documentation (Using mapInPandas with UDFs).

Questions 11

A machine learning engineering team has a Job with three successive tasks. Each task runs a single notebook. The team has been alerted that the Job has failed in its latest run.

Which of the following approaches can the team use to identify which task is the cause of the failure?

A. Run each notebook interactively

B. Review the matrix view in the Job's runs

C. Migrate the Job to a Delta Live Tables pipeline

D. Change each Task's setting to use a dedicated cluster

Buy Now

Correct Answer: B

To identify which task is causing the failure in the job, the team should review the matrix view in the Job's runs. The matrix view provides a clear and detailed overview of each task's status, allowing the team to quickly identify which task

failed. This approach ismore efficient than running each notebook interactively, as it provides immediate insights into the job's execution flow and any issues that occurred during the run.

References:

Databricks documentation on Jobs: Jobs in Databricks

Questions 12

A data scientist has produced three new models for a single machine learning problem. In the past, the solution used just one model. All four models have nearly the same prediction latency, but a machine learning engineer suggests that the new solution will be less time efficient during inference.

In which situation will the machine learning engineer be correct?

A. When the new solution requires if-else logic determining which model to use to compute each prediction

B. When the new solution's models have an average latency that is larger than the size of the original model

C. When the new solution requires the use of fewer feature variables than the original model

D. When the new solution requires that each model computes a prediction for every record

E. When the new solution's models have an average size that is larger than the size of the original model

Buy Now

Correct Answer: D

If the new solution requires that each of the three models computes a prediction for every record, the time efficiency during inference will be reduced. This is because the inference process now involves running multiple models instead of a single model, thereby increasing the overall computation time for each record. In scenarios where inference must be done by multiple models for each record, the latency accumulates, making the process less time efficient compared to using a single model. References: Model Ensemble Techniques

Questions 13

A data scientist is attempting to tune a logistic regression model logistic using scikit-learn. They want to specify a search space for two hyperparameters and let the tuning process randomly select values for each evaluation.

They attempt to run the following code block, but it does not accomplish the desired task:

Which of the following changes can the data scientist make to accomplish the task?

A. Replace the GridSearchCV operation with RandomizedSearchCV

B. Replace the GridSearchCV operation with cross_validate

C. Replace the GridSearchCV operation with ParameterGrid

D. Replace the random_state=0 argument with random_state=1

E. Replace the penalty= ['12', '11'] argument with penalty=uniform ('12', '11')

Buy Now

Correct Answer: A

The user wants to specify a search space for hyperparameters and let the tuning process randomly select values.GridSearchCVsystematically tries every combination of the provided hyperparameter values, which can be computationally expensive and time-consuming.RandomizedSearchCV, on the other hand, samples hyperparameters from a distribution for a fixed number of iterations. This approach is usually faster and still can find very good parameters, especially when the search space is large or includes distributions. References: Scikit-Learn documentation on hyperparameter tuning: https://scikit-learn.org/stable/modules/grid_search.html#randomized-parameter-optimization

Exam Code: DATABRICKS-MACHINE-LEARNING-ASSOCIATE
Exam Name: Databricks Certified Machine Learning Associate
Last Update: Jun 13, 2025
Questions: 74

PDF (Q&A)

$45.99
ADD TO CART

VCE

$49.99
ADD TO CART

PDF + VCE

$59.99
ADD TO CART

Exam2Pass----The Most Reliable Exam Preparation Assistance

There are tens of thousands of certification exam dumps provided on the internet. And how to choose the most reliable one among them is the first problem one certification candidate should face. Exam2Pass provide a shot cut to pass the exam and get the certification. If you need help on any questions or any Exam2Pass exam PDF and VCE simulators, customer support team is ready to help at any time when required.

Home | Guarantee & Policy |  Privacy & Policy |  Terms & Conditions |  How to buy |  FAQs |  About Us |  Contact Us |  Demo |  Reviews

2025 Copyright @ exam2pass.com All trademarks are the property of their respective vendors. We are not associated with any of them.