Google Professional Machine Learning Engineer Exam: What to Expect

The definitive guide

Oleh Lokshyn
Towards Data Science

--

My Google Cloud ML Badge
My Google Cloud ML Certificate

Disclaimer: This article does not contain any information that is not already publicly available to anybody who is preparing for the exam. This article does not contain questions, in whole or part, or answers, in whole or part, from the actual exam. When I use the term “default answer” or suggest an answer to a possible question I try my best to make sure the answer is correct based on the provided documentation or preparation materials, but I cannot guarantee that the answer is correct. Likewise, I cannot guarantee that the list of topics is either complete or fully corresponds to the actual exam questions — this is only my best effort based on the preparation materials. The goal of this article is to help you be better prepared, and I would advise against any form of cheating during the preparation or the actual exam.

Update 2020–11–28: Added an awesome “GCP ML modeling solutions diagram” at the end of the article. Check it out!

Introduction

Yesterday, 2020–11–24, I passed the Google Certified Professional Machine Learning Engineer Exam (that’s quite a mouthful, will refer to it as just the exam from now on). I feel obligated to share the experience with my fellow ML engineers because the road to that sacred PASSED result should not be as complicated as it is now.

I had only two weeks of preparation, but I would recommend having at least one month for experienced engineers. I think in my case two weeks were enough because:

  1. I have passed Google Professional Data Engineer before, so I already know the exam format and familiar with the nitty-gritty details of GCP services.
  2. ML Engineering is my daily job, and that really helped a lot during the exam — I recalled the problems we faced and the solutions we applied.

Important notice: There will be no question or answer dumps — this is unfair, I don’t want to spoil your fun.

Important notice: be prepared that your preparation won’t be enough to be prepared! This is also true for the Data Engineer exam: the sample questions, courses, and other preparation materials do not reflect the complexity of the actual questions! While the topics are the same, expect the real question to touch on the limitations of the services or even to present several applicable solutions with one being slightly more “the official way to do it”. I guess this is where the requirement of 3 years of practical experience comes from.

Important notice: I suppose Google picks the questions randomly so your mileage may vary.

Exam format: 60 questions, 120 minutes. Most of them are single-choice questions, but there were fewer than 5 multiple-choice questions. You are not required to do any calculations, and you don’t get a paper for notes. There are questions with code snippets in Python.

The Feel of the Exam

I have selected an offline proctored exam at the testing center, because any small thing at home, like the connection break or a sudden cat jump, can spoil your attempt.

When the exam started I was caught unprepared by the very first question, so my advice here is not to judge yourself until you’re done with the exam. Just keep answering — the questions are in no particular order, so you might get something really complex from the start and easy questions will follow.

Speaking of the complexity, the questions are more complex than the sample questions on average, but that’s to be expected after the Data Engineer exam. I was not sure about ~10 questions and didn’t know the answer for ~5 questions. Thus, I’ve marked for review around 15 questions. It was rather hard to keep the speed of 2 mins/question, so in the end, I had to speed up myself and got ~6 minutes to review the marked questions. My advice here would be to watch your timing closely because, on this exam, you can easily get out of time. Besides, do not mark for the review the questions you are more than 70% sure about.

I think I have figured out a useful strategy on how to read the questions. First, read the question in full, even if you think you already know the answer. Be sure if the question asks for positive statements (AI Platform does support…) or for negative statements (AI Platform does not support…). Second, read the question again and build a mental model of the problem: supervised or unsupervised problem, regression, binary or multi-class classification. Third, read the question once again (the last time, I promise) and now look for the requirements and limitations: “is the dataset available and where is it?”, “are there custom categories?”, or “is this something that ML APIs can pick up?”, “are you required to code the model?”, “is there a rush in development/evaluation?” and so on. Fourth, read each possible answer and argue why it might or might not be suitable. Remember — you have to read all the answers before you can select one. Google loves to give several applicable and perfectly working answers, but one of them will be slightly more suitable given the requirements in the question.

And last, but not least, the answer to the question sometimes can be figured out just by comparing the answer options! If you see that some technology is repeated in two options, it might be the right thing. For example (not the real answers, just an example), if you see:

  1. Pub/Sub, Dataflow, BigQuery,
  2. Pub/Sub, Dataflow, CloudSQL
  3. DataProc, Cloud Storage, Cloud SQL
  4. Cloud Storage, Cloud Functions, BigQuery

You can conclude, that:

  1. Pub/Sub and Dataflow are repeated twice, so, probably, they are the right options.
  2. Cloud Storage is also repeated twice, but it’s paired with different technologies each time, so given it conflicts with Pub/Sub and Dataflow you should pick Pub/Sub and Dataflow because they are repeated as a pair.
  3. Now you only need to choose between BigQuery and CloudSQL.

But please, reside to this guessing only if you really don’t know the answer to the question.

Read on for the list of the preparation materials.

Official Materials

The biggest problem of this exam is the absence of the preparation materials. Let’s see what Google has to offer:

  1. Get 3+ years of hands-on experience — not particularly helpful, and don’t worry if you don’t have 3 years on GCP.
  2. The list of topics on the exam. While this is a good start, it’s too high-level to be of practical use.
  3. 10 sample questions. Those are really good, so make sure you fully understand the reasoning behind the correct and wrong answers. Luckily, when you answer all the questions, they give you explanations for each option.
  4. The learning path. Take my advice here with a grin of sault, because I haven’t followed this learning path. I reviewed it and concluded that it’s too basic and too time-consuming — I had only two weeks. Nevertheless, if you do have a couple of weeks to spare, better follow this path to be bullet-proof.
  5. Certification Prep: Machine Learning Certification. This one was disappointing for me for several reasons: 1) no new sample questions, they’ve just reviewed 4 out of 10 already presented sample questions; 2) only generic advice like “get the hands-on experience” and “read the docs”. I expected so much more from it, but you still can watch it because it’s short and “official”.
  6. MLOps on Google Cloud. This could be useful if you know how to cook it! Do not pay much attention to what the speaker has to say — I was under the impression that he just reads Google’s marketing materials. But do pay attention to what’s on the slides! Make sure that you understand every product name and every diagram: why those services have been chosen to work together.
  7. Google Cloud Documentation. I write it with a smile on my face. While this resource is wonderful and you can spend your entire life reading it, “read the documentation” is not the advice you’d like to get a month before your exam (Google, I’m looking at you!). I will provide the links to the parts you should really care about.

And that’s it.

Topics on the Exam

I’ll do my best but the list cannot be exhaustive. I will also include the default applicable answers where possible — again, not the actual exam answers, just what the documentation recommends, so beware :). The actual links to the relevant resources could be found in the next session.

As general advice, while reading the problem statement try to understand whether you are dealing with regression, binary classification, or multi-class classification problems. This will help you to understand what methods and metrics are available to you.

You might expect on the exam:

  1. TensorFlow Keras API. You have to be able to understand the sequential model architecture, what layers actually define parameters, like dense layers, what are dropout layers, what are convolutional layers.
  2. TensorFlow distributed training. The general answer is that GPU training is faster than CPU training, and GPU usually doesn’t require any additional setup. TPUs are faster than GPUs but have their limitations. Besides, make sure what replica roles mean: master, worker, parameter server, evaluator, and how many of each you can get. If you need to optimize the distributed training, the default answers are: 1) use the tf.data.Dataset API for the input; 2) interleave the pipeline steps by enabling parallelism; 3) Keras API has better support for distributed training than Estimator API.
  3. Know your feature engineering in TensorFlow and how to produce the following features: numerical, categorical one-hot encoded/embedded/hashed, bucketized one-hot encoded/embedded/hashed, crossed.
  4. TensorFlow Extended, or TFX. You have to know the components and how to build the pipeline out of them.
  5. AI Platform distributed training with containers. The default answer is if you have a distributed training app, you can package each component in a separate container (master, worker, parameter server) and deploy it on the AI Platform.
  6. AI Platform distributed training. This is essentially the union of TensorFlow distributed training topics and AI Platform containers distributed training. However, note that distributed training is not supported for models using scikit-learn (may have guessed) or XGBoost environments.
  7. AI Platform Hyperparameter tuning. Might be useful to know that Bayesian optimization is used under the hood.
  8. AI Platform built-in algorithms. This is something in-between AutoML and custom code: you still have to do the data preprocessing, feature engineering, and hyperparameter tuning, but the model itself is already implemented. Be aware that built-in algorithms do not support distributed training.
  9. BigQuery ML. The default answer is that if your data is already in BigQuery and you want the output to also be there, you should use BigQuery ML for your modeling. But be aware of the limitations.
  10. Auto ML vs ML APIs. The default answer is to go for ML API unless you have something custom, like to detect the products of your own company in the images or classify the transcripts of support calls to your company.
  11. Be able to select the modeling tool from the most managed to the most customizable spectrum: ML APIs -> Auto ML -> BigQuery ML -> AI Platform Built-in algorithms -> AI Platform Training. In general, try to stick to the left part of the spectrum, unless you hit the limitations of the selected technology. Thus, if you need distributed training, you can’t use built-in algorithms.
  12. Recommendations AI. Read the docs at least, some practice is recommended.
  13. Deep Learning VM. Do know how to troubleshoot it! Practical experience is desirable.
  14. Explainable AI. There are just 3 feature attribution methods that you should care about (at least GCP’s documentation mentions only three, there may be more): Integrated Gradients, XAI, and Sampled Shapely. AutoML Tables also support explanations.
  15. TensorBoard What-if tool. Used to find biases in the model.
  16. Data Labeling service allows you to request human labeling of your dataset. The default answer is that you never label the dataset yourself, only pay Google to do this.
  17. The question may call for the most time-efficient solution or has the requirement to build a solution quickly. They mean it — just pay attention to how many times Lak emphasizes this in the preparation video. In this case, you have to pick the quickest solution that involves as few steps as possible. The quickest solution is usually not the most performant one.
  18. Dataflow, Pub/Sub, and data pipelines in general. Dataflow is your only real option for the batch and streaming pipelines. When you need a streaming pipeline, this will always be Pub/Sub + Dataflow. The sink data storage of the streaming pipeline may differ, but you can stream in BigQuery or BigTable. Also, remember that Cloud Storage usually means batch pipeline. If you are asked for the “realtime” it means streaming pipeline, so you should usually go with Pub/Sub + Dataflow + [BigQuery | Pub/Sub].
  19. Cloud Monitoring. You just have to know that AI Platform Training and AI Platform Prediction have built-in metric monitoring that you can view with Cloud Monitoring. You can also add your own metrics. The default answer is your model cannot be deployed without live monitoring.
  20. Know how to set up a continuous evaluation of your model. The default answer is you can’t deploy a model unless you provide a continuous evaluation of it.
  21. Kubeflow and Kubeflow Pipelines. Kubeflow also has built-in Kubeflow Metadata for artifact tracking.
  22. Precision, Recall, and F1 score. You should know the following: use Precision to minimize False Positives, use Recall to minimize False Negatives, use F1 score to balance both. Usually (not always!), minimizing Precision means increasing Recall and vice-versa.
  23. AUC ROC and AUC PR. In general, AUC ROC is preferred because it is classification-threshold invariant, scale-invariant, and class-balance-invariant. AUC PR is inferior to AUC ROC because AUC PR is class-balance-dependant, but has its use cases.
  24. Know what transfer learning is.
  25. Know what makes a good feature: 1) related to the objective; 2) known at prediction time; 3) definition won’t change over time; 4) numeric with meaningful magnitude (not ordinal but cardinal); 5) has enough examples; 6) brings human insights to the problem.
  26. Know hyperparameter tuning guidelines: 1) when lowering the learning rate, increase the batch size (or number of epochs); 2) small batch sizes causes oscillation in the loss; 3) high learning rate causes jumps in the loss;
  27. Know what canary deployment is and how it is different from A/B testing.
  28. Know how to handle missing data. If the feature is important (if not, drop it), the recommended strategy is to provide an additional column that says whether the data is missing in the original column, and only then replace the missing values with the mean/mode.
  29. Know what Cloud Composer is, but it may not be the right solution to the problem.
  30. Know what Cloud Functions are, but again, they are probably not the right solution.
  31. Know what Cloud SQL is. It can be the right solution for transactional data storage. By this, I mean the storage used by apps that serve users.
  32. Not all problems require ML solutions.

Recommended Materials

  1. Machine Learning with TensorFlow on Google Cloud Specialization on Coursera. This provides an understanding of the TensorFlow Keras API. Please do the following: 1) write the code in the labs yourself; 2) pay attention to course #4 that goes over the tf.transform module.
  2. Distributed TensorFlow training. Understand and learn distribution strategies by heart. I cannot say that practical experience is required.
  3. How to debug and optimize TensorFlow performance. Also on GPU.
  4. How to optimize the input pipeline performance with tf.data.
  5. TPUs limitations. While TPUs are faster than GPUs, they are not as general-purpose and support only particular models. TPUs are also more expensive. Here is a good TPU vs GPU comparison.
  6. Know well how to optimize your TensorFlow distributed training. This is a really good watch, especially the part about parallelism in the pipelines that allow interleaving the steps.
  7. TFX guide.
  8. AI Platform training with containers.
  9. AI Platform distributed training with containers.
  10. AI Platform Hyperparameter Tuning.
  11. AI Platform Built-in algorithms.
  12. BigQuery ML models.
  13. Recommendations AI.
  14. Deep Learning VM troubleshooting.
  15. Feature attribution methods. You should always try to use XRAI for images and Integrated gradients for everything else that’s differentiable. Use Sampled Shapley only for ensembles and other non-differentiable models. Note that XRAI is based on Integrated Gradients and also can’t be used on non-differentiable models.
  16. Using Explanation AI.
  17. TensorBoard What-if tool.
  18. Data Labeling Service.
  19. AI Platform Prediction Monitoring.
  20. AI Platform Training Monitoring.
  21. AI Platform continuous evaluation.
  22. Kubeflow Pipelines.
  23. Kubeflow Metadata.
  24. Precision, Recall, and ROC AUC. Don’t forget to read about the difference between ROC AUC and ROC PR.
  25. Canary deployment and A/B testing.

Awesome Additions

I have compiled a flowchart and several comparison tables during my preparation and found them really useful, so enjoy!

GCP ML Modeling Solutions

If you have anything to add to this diagram — please leave a comment!

TensorFlow distribution strategies

Auto ML vs ML APIs

Rounding Up

So, my dear learner, hope this knowledge will help you. At least it would have helped me if I had started from the beginning.

I plan to update this article when I find more useful links and learning materials.

Your feedback is appreciated, let’s transform this article into a definitive guide to the Google Professional Machine Learning Engineer!

Wish you luck on your path,

Oleh

--

--