Google Cloud Certified Professional Data Engineer Exam

Professional Data Engineers enable data-driven decision making by collecting, transforming, and publishing data. A Data Engineer should be able to design, build, operationalize, secure, and monitor data processing systems with a particular emphasis on security and compliance; scalability and efficiency; reliability and fidelity; and flexibility and portability. A Data Engineer should also be able to leverage, deploy, and continuously train pre-existing machine learning models.

The Professional Data Engineer exam assesses your ability to:
Design data processing systems
Operationalize machine learning models
Ensure solution quality
Build and operationalize data processing systems

About this certification exam
Length: 2 hours
Registration fee: $200 (plus tax where applicable)
Languages: English, Japanese.
Exam format: Multiple choice and multiple select taken remotely or in person at a test center. Locate a test center near you.

Exam Delivery Method:
a. Take the online-proctored exam from a remote location
b. Take the onsite-proctored exam at a testing center

Prerequisites: None
Recommended experience: 3+ years of industry experience including 1+ years designing and managing solutions using Google Cloud.

Certification Renewal / Recertification: Candidates must recertify in order to maintain their certification status. Unless explicitly stated in the detailed exam descriptions, all Google Cloud certifications are valid for two years from the date of certification. Recertification is accomplished by retaking the exam during the recertification eligibility time period and achieving a passing score. You may attempt recertification starting 60 days prior to your certification expiration date.

Exam overview
1. Review the exam guide

The exam guide contains a complete list of topics that may be included on the exam, helping you determine if your skills align with the topics on the exam.
2. Training

Professional Data Engineer

Certification exam guide
A Professional Data Engineer enables data-driven decision-making by collecting, transforming, and publishing data. A data engineer should be able to design, build, operationalize, secure, and monitor data processing systems with a particular emphasis on security and compliance; scalability and efficiency; reliability and fidelity; and flexibility and portability. A data engineer should also be able to leverage, deploy, and continuously train pre-existing machine learning models.
Section 1. Designing data processing systems

1.1 Selecting the appropriate storage technologies. Considerations include:
Mapping storage systems to business requirements
Data modeling
Trade-offs involving latency, throughput, transactions
Distributed systems
Schema design

1.2 Designing data pipelines. Considerations include:
Data publishing and visualization (e.g., BigQuery)
Batch and streaming data (e.g., Dataflow, Dataproc, Apache Beam, Apache Spark and Hadoop ecosystem, Pub/Sub, Apache Kafka)
Online (interactive) vs. batch predictions
Job automation and orchestration (e.g., Cloud Composer)

1.3 Designing a data processing solution. Considerations include:
Choice of infrastructure
System availability and fault tolerance
Use of distributed systems
Capacity planning
Hybrid cloud and edge computing
Architecture options (e.g., message brokers, message queues, middleware, service-oriented architecture, serverless functions)
At least once, in-order, and exactly once, etc., event processing

1.4 Migrating data warehousing and data processing. Considerations include:
Awareness of current state and how to migrate a design to a future state
Migrating from on-premises to cloud (Data Transfer Service, Transfer Appliance, Cloud Networking)
Validating a migration

Section 2. Building and operationalizing data processing systems

2.1 Building and operationalizing storage systems. Considerations include:
Effective use of managed services (Cloud Bigtable, Cloud Spanner, Cloud SQL, BigQuery, Cloud Storage, Datastore, Memorystore)
Storage costs and performance
Life cycle management of data

2.2 Building and operationalizing pipelines. Considerations include:
Data cleansing
Batch and streaming
Transformation
Data acquisition and import
Integrating with new data sources

2.3 Building and operationalizing processing infrastructure. Considerations include:
Provisioning resources
Monitoring pipelines
Adjusting pipelines
Testing and quality control

Section 3. Operationalizing machine learning models

3.1 Leveraging pre-built ML models as a service. Considerations include:
ML APIs (e.g., Vision API, Speech API)
Customizing ML APIs (e.g., AutoML Vision, Auto ML text)
Conversational experiences (e.g., Dialogflow)

3.2 Deploying an ML pipeline. Considerations include:
Ingesting appropriate data
Retraining of machine learning models (AI Platform Prediction and Training, BigQuery ML, Kubeflow, Spark ML)
Continuous evaluation

3.3 Choosing the appropriate training and serving infrastructure. Considerations include:
Distributed vs. single machine
Use of edge compute
Hardware accelerators (e.g., GPU, TPU)

3.4 Measuring, monitoring, and troubleshooting machine learning models. Considerations include:
Machine learning terminology (e.g., features, labels, models, regression, classification, recommendation, supervised and unsupervised learning, evaluation metrics)
Impact of dependencies of machine learning models
Common sources of error (e.g., assumptions about data)

Section4. Ensuring solution quality

4.1 Designing for security and compliance. Considerations include:
Identity and access management (e.g., Cloud IAM)
Data security (encryption, key management)
Ensuring privacy (e.g., Data Loss Prevention API)
Legal compliance (e.g., Health Insurance Portability and Accountability Act (HIPAA), Children’s Online Privacy Protection Act (COPPA), FedRAMP, General Data Protection Regulation (GDPR))

4.2 Ensuring scalability and efficiency. Considerations include:
Building and running test suites
Pipeline monitoring (e.g., Cloud Monitoring)
Assessing, troubleshooting, and improving data representations and data processing infrastructure
Resizing and autoscaling resources

4.3 Ensuring reliability and fidelity. Considerations include:
Performing data preparation and quality control (e.g., Dataprep)
Verification and monitoring
Planning, executing, and stress testing data recovery (fault tolerance, rerunning failed jobs, performing retrospective re-analysis)
Choosing between ACID, idempotent, eventually consistent requirements

4.4 Ensuring flexibility and portability. Considerations include:
Mapping to current and future business requirements
Designing for data and application portability (e.g., multicloud, data residency requirements)
Data staging, cataloging, and discovery

QUESTION 1
Your company built a TensorFlow neutral-network model with a large number of neurons and layers.
The model fits well for the training data. However, when tested against new data, it performs poorly.
What method can you employ to address this?

A. Threading
B. Serialization
C. Dropout Methods
D. Dimensionality Reduction

Answer: C

QUESTION 2
You are building a model to make clothing recommendations. You know a user’s fashion preference is likely to
change over time, so you build a data pipeline to stream new data back to the model as it becomes available.
How should you use this data to train the model?

A. Continuously retrain the model on just the new data.
B. Continuously retrain the model on a combination of existing data and the new data.
C. Train on the existing data while using the new data as your test set.
D. Train on the new data while using the existing data as your test set.

Answer: B

QUESTION 3
You designed a database for patient records as a pilot project to cover a few hundred patients in three clinics.
Your design used a single database table to represent all patients and their visits, and you used self-joins to
generate reports. The server resource utilization was at 50%. Since then, the scope of the project has
expanded. The database must now store 100 times more patient records. You can no longer run the reports,
because they either take too long or they encounter errors with insufficient compute resources. How should
you adjust the database design?

A. Add capacity (memory and disk space) to the database server by the order of 200.
B. Shard the tables into smaller ones based on date ranges, and only generate reports with prespecified date ranges.
C. Normalize the master patient-record table into the patient table and the visits table, and create other ecessary tables to avoid self-join.
D. Partition the table into smaller tables, with one for each clinic. Run queries against the smaller table pairs,
and use unions for consolidated reports.

Answer: C

QUESTION 4
You create an important report for your large team in Google Data Studio 360. The report uses Google
BigQuery as its data source. You notice that visualizations are not showing data that is less than 1 hour old.
What should you do?

A. Disable caching by editing the report settings.
B. Disable caching in BigQuery by editing table details.
C. Refresh your browser tab showing the visualizations.
D. Clear your browser history for the past hour then reload the tab showing the virtualizations.

Answer: A