DP-200: Implementing an Azure Data Solution - Exam Prep

A collection of resources, study notes, and learning material that helped me, and can hopefully help others, prepare for and pass exam DP-200: Implementing an Azure Data Solution. Note: Passing DP-200 is one out of two steps required to become a Microsoft Certified: Azure Data Engineer, you must pass both DP-200 and DP-201.

Suggested Approach

  1. Briefly skim through and familiarise yourself with the list of Resources and Skills Measured below (Tip: In regards to the Skills Measured, refer to the latest skills outline available directly from the exam home page as the content changes from time to time. The copy in this article is as of the 31st December 2019, key phrases have been pre-highlighted to aid learning).

  2. Complete the DP-200 Microsoft Learn Collection. This is a custom learning path that will provide decent coverage across the various topics and technologies included in the exam.

  3. Re-visit the Skills Measured section below upon completing the learning path and dive into the links for any areas that were not covered by the Microsoft Learn collection and/or require a deeper understanding.

  4. Lastly, read-over the Reference Tables to consume succinct summaries of key topics and comparisons.

Resource Link
Exam Exam DP-200: Implementing an Azure Data Solution
Related Certification Microsoft Certified: Azure Data Engineer Associate
Microsoft Learn Azure Data Engineer Learning Paths
Hands-On Labs Microsoft Learning - GitHub: DP-200 Hands-On Labs
MS Learn Collection Custom Microsoft Learn Collection for DP-200

Skills Measured

Note: The content of the exam was updated on the 4th December 2019 to include Azure Data Explorer and rename references of “SQL Data Warehouse” to “Azure Synapse Analytics”. Tip: Always refer to the latest skills outline available on the exam home page in case of any recent content changes.

 

Approximate Split of Content by Section

IDSS = Implement Data Storage Solutions; M&ODS = Manage & Optimize Data Solutions; M&DDP = Manage & Develop Data Processing
 

1. Implement data storage solutions (40-45%)

Implement non-relational data stores

  • implement a solution that uses Cosmos DB[1] , Data Lake Storage Gen2[2], or Blob storage[3]
  • implement data distribution and partitions [1] [2]
  • implement a consistency model in Cosmos DB [1] [2]
  • provision a non-relational data store [1] [2]
  • provide access to data to meet security requirements [1]
  • implement for high availability, disaster recovery, and global distribution [1] [2] [3]

Implement relational data stores

  • configure elastic pools [1]
  • configure geo-replication [1]
  • provide access to data to meet security requirements [1] [2] [3]
  • implement for high availability, disaster recovery, and global distribution [1] [2] [3]
  • implement data distribution and partitions for Azure Synapse Analytics [1] [2]
  • Implement PolyBase [1] [2]

Manage data security

  • implement data masking [1] [2] [3]
  • encrypt data at rest and in motion [1] [2]

2. Manage and develop data processing (25-30%)

Develop batch processing solutions

  • develop batch processing solutions by using Data Factory and Azure Databricks [1]
  • ingest data by using PolyBase [1] [2]
  • implement the integration runtime for Data Factory [1] [2] [3]
  • create linked services and datasets [1] [2]
  • create pipelines and activities [1]
  • create and schedule triggers [1]
  • implement Azure Databricks clusters[1], notebooks[2], jobs[3], and autoscaling[4]
  • ingest data into Azure Databricks [1]

Develop streaming solutions

  • configure input and output [1] [2]
  • select the appropriate windowing functions [1] [2]
  • implement event processing by using Stream Analytics [1] [2]
  • ingest and query streaming data with Azure Data Explorer [1] [2]

3. Monitor and optimize data solutions (30-35%)

Monitor data storage

  • monitor relational and non-relational data sources
  • implement Blob storage monitoring [1] [2]
  • implement Data Lake Store monitoring [1]
  • implement SQL Database monitoring [1] [2]
  • implement Azure Synapse Analytics monitoring [1]
  • implement Cosmos DB monitoring [1] [2]
  • implement Azure Data Explorer monitoring [1] [2]
  • configure Azure Monitor alerts [1]
  • implement auditing by using Azure Log Analytics [1]

Monitor data processing

  • design and implement Data Factory monitoring [1] [2]
  • monitor Azure Databricks [1] [2]
  • monitor Stream Analytics [1] [2]
  • configure Azure Monitor alerts [1]
  • implement auditing by using Azure Log Analytics [1]

Optimize Azure data solutions

  • troubleshoot data partitioning bottlenecks [1] [2]
  • optimize Data Lake Storage [1]
  • optimize Stream Analytics [1] [2] [3]
  • optimize Azure Synapse Analytics [1] [2]
  • optimize SQL Database [1]
  • manage data life cycle [1]

Reference Tables

Cosmos DB - Consistency Levels

Level Description
Strong Strongest Consistency; Highest Latency;
Bounded Staleness Reads may lag behind writes at most k versions or t interval
Session Scoped to a client session; Guaranteed to read your own writes;
Consistent Prefix Reads never see out of order writes
Eventual Weakest Consistency; Lowest Latency; Eventual Convergence;

Cosmos DB - Entities by API

Entity SQL Cassandra MongoDB Gremlin Table
Database Database Keyspace Database Database N/A
Container Container Table Collection Graph Table
Item Document Row Document Node or Edge Item

Azure Synapse Analytics - Distributed or Replicated Tables

Type Great fit for...
Replicated Small dimension tables (<2GB)
Round Robin (default) Temporary/Staging Table
Hash Fact Tables; Large Dimension Tables

Azure SQL Database - Dynamic Data Masking

Function Description Example
Default Full masking according to the data types of the designated fields Default value (0, xxxx, 01-01-1900)
Credit Card Masking method, which exposes the last four digits of the designated fields XXXX-XXXX-XXXX-1234
Email Masking method, which exposes the first letter and replaces the domain with XXX.com aXX@XXXX.com
Random Number Masking method, which generates a random number Random number range
Custom Text Masking method, which exposes the first and last characters Custom string (prefix [padding] suffix)

Azure SQL Database - Service Tiers

Basic Standard Premium
Target Workload Development & Production Development & Production Development & Production
Uptime SLA 99.9% 99.9% 99.9%
Maximum Backup Retention 7 days 35 days 35 days
CPU Low Low, Medium, High Medium, High
IO Throughput (approximate) 1-5 IOPS per DTU 1-5 IOPS per DTU 25 IOPS per DTU
IO Latency (approximate) 5ms (read), 10ms (write) 5ms (read), 10ms (write) 2ms (read/write)
Columnstore indexing N/A S3 and above Supported
In-memory OLTP N/A N/A Supported

Azure SQL Database - General Purpose vs Hyperscale

General Purpose Hyperscale
Best for... Budget orientated; Balanced compute and storage; Auto-scaling storage size; Vertical and horizontal scaling;
Storage Type Premium Remote Storage (per instance) De-coupled storage with local SSD cache (per instance)
Database Size 5 GB - 4 TB Up to 100 TB
Log Write Throughput 1.875 MB/s per vCore (max 30 MB/s) 100 MB/s

Azure SQL Database - Security Overview

Layer Type Description
Network IP Firewall Rules Grant access to databases based on the originating IP address of each request.
Network Virtual Network Firewall Rules Only accept communications that are sent from selected subnets inside a virtual network.
Access Management SQL Authentication Authentication of a user when using a username and password.
Access Management Azure AD Authentication Leverage centrally managed identities in Azure Active Directory (Azure AD).
Authorization Row-level Security Control access to rows in a table based on the characteristics of the user/query.
Threat Protection Auditing Tracks database activities by recording events to an audit log in an Azure storage account.
Threat Protection Advanced Threat Protection Analyzing SQL Server logs to detect unusual and potentially harmful behavior.
Information Protection Transport Layer Security (TLS) Encryption-in-transit between client and server.
Information Protection Transparent Data Encryption (TDE) Encryption-at-rest using AES (Azure SQL DB encrypted by default).
Information Protection Always Encrypted Encryption-in-use (Column-level granularity; Decrypted only for processing by client).
Information Protection Dynamic Data Masking Limits sensitive data exposure by masking it to non-privileged users.
Security Management Vulnerability Assessment Discover, track, and help remediate potential database vulnerabilities.
Security Management Data Discovery & Classification Discovering, classifying, labeling, and protecting the sensitive data in your databases.
Security Management Compliance Been certified against a number of compliance standards.

Data Migration Tools and Services

Tool Description
Data Migration Assistant (DMA) Helps you upgrade to a modern data platform by detecting compatibility issues that can impact database functionality in your new version of SQL Server or Azure SQL Database.
Azure Database Migration Service (DMS) A fully managed service designed to enable seamless migrations from multiple database sources to Azure data platforms with minimal downtime (online migrations).
SQL Data Sync A service built on Azure SQL Database that lets you synchronize the data you select bi-directionally across multiple SQL databases and SQL Server instances.
Database Experimentation Assistant (DEA) Helps you evaluate a targeted version of SQL Server for a specific workload. Customers upgrading from earlier versions of SQL Server (starting with 2005) to more recent versions of SQL Server can use the analysis metrics that the tool provides.
SQL Server Migration Assistant (SSMA) A tool designed to automate database migration to SQL Server from Microsoft Access, DB2, MySQL, Oracle, and SAP ASE.

Databricks - Cluster Configurations

Standard High Concurrency
Recommended for... Single User Multiple Users
Language Support SQL, Python, R, and Scala SQL, Python, and R (not Scala)
Notebook Isolation No Yes

Azure Storage - Redundancy

Abbreviation Full Name Description
LRS Locally Redundant Storage Replicates your data three times within a single data center
ZRS Zone-Redundant Storage Replicates your data across three storage clusters in a single region.
GRS Geo-Redundant Storage Replicates your data to a secondary region. Can withstand regional outage.
RA-GRS Read-Access Geo-Redundant Storage Provides read-only access to the data in the secondary location, in addition to GRS.
GZRS Geo-Zone-Redundant Storage Replicates data across three Azure Availability Zones in two regions.
RA-GZRS Read-Access Geo-Zone-Redundant Storage Provides read-only access to the data in the secondary location, in addition to GZRS.

Azure Storage - Redundancy Scenarios

Scenario LRS ZRS GRS GZRS
Node Unavailability (within a DC) Yes Yes Yes Yes
DC Outage No Yes Yes Yes
Region Outage No No Yes Yes
Read-Access in the event of a Region Outage No No
Yes
with RA-GRS
Yes
with RA-GZRS
Durability over a given year 11 9's 12 9's 16 9's 16 9's
Note: Maximum Durability = Geo (Node > DC > Region); High Availability = ZRS/GZRS; Maximum Availability = RA-ZRS/GZRS;

Azure Data Factory - Triggers

Type Description
Schedule Runs on a wall-clock schedule (e.g. every X mins/h/d/w/m's).
Tumbling Window A series of fixed-sized, non-overlapping, and contiguous time intervals.
Event-based Runs pipelines in response to an event (e.g. Blob Created/Deleted).
Note: Only Tumbling Window triggers support advanced properties such as Trigger Dependencies, Concurrency and Retry Policy.

Azure Data Factory - Integration Runtime Types

Type Capability Network Support
Azure Data Flow, Data Movement, Activity Dispatch Public
Self-Hosted Data Movement, Activity Dispatch Public or Private
Azure-SSIS SSIS Package Execution Public or Private