Getting Started with Azure Data Factory

Content

  1. What is Azure Data Factory?
  2. Key Concepts
  3. Visual Authoring
  4. Supported Data Stores
  5. Demo: Blob Storage to Cosmos DB with Zero Code [Video]

What is Azure Data Factory?

Azure Data Factory (ADF) is a fully managed data integration service that enables the orchestration and automation of data movement and data transformation in the cloud. Azure Data Factory works with heterogeneous environments, enabling data-driven workflows to integrate disparate cloud and on-premise data sources.

Important to note:

  • In this article, I will be referring to Azure Data Factory V2. ADF V2 is currently in public preview. For information about pricing, head over to https://azure.microsoft.com/en-gb/pricing/details/data-factory/v2/
  • At time of this post, Azure Data Factory is available in only three regions (East US, East US 2 and West Europe).
  • That said, Azure Data Factory does not persist any data itself. All data movement and transformation activities are handled by Integration Runtimes which are available globally to ensure data compliance and efficiency.
  • Azure Data Factory purely provides the mechanisim to create, manage and monitor data workflows.
data-factory-diagram.png

Key Concepts

Pipeline
A pipeline is a data-driven workflow, logically grouping activities to perform a task (e.g. ingest and load).

Activity
An activity defines an action and falls into one of three categories: Data Movement (e.g. Copy Data), Data Transformation (e.g. HDInsight, U-SQL) or Control Flow (e.g. ForEach, Lookup, Wait).

Dataset
A dataset represents the structure of a data store (e.g. table, file, document) that is intended to be used as an input or output within an activity.

Linked Service
A linked service provides Data Factory the necessary information to establish connectivity to an external resource (i.e. a connection string to a dataset).

The following diagram illustrates the relationships between these core elements.

adfConcept.jpg

Visual Authoring

One of the most recent developments for Azure Data Factory is the release of Visual Tools, a low-code, drag and drop approach to create, configure, deploy and monitor data integration pipelines. If you have used Data Factory in the past, you would be familiar with the fact that this type of capabiltiy was previously only possible programatically, either using Azure PowerShell, a supported SDK (Python, .NET) or invoking the ADF V2 REST API.

To launch Visual Tools, navigate to an Azure Data Factory resource via the Azure Portal and click "Author & Monitor".

adfVisualTools.jpg

Supported Data Stores

At time of this post, data stores fall into one of two two categories of support.

First class: 

  • Data stores with first class support are fully configurable by adjusting property values via the GUI (i.e. Visual Tools will generate the required JSON behind the scenes). 
  • First class data stores include: Azure Blob, Azure CosmosDB, Azure Database for MySQL, Azure Data Lake Store, Amazon Redshift, Amazon S3, Azure SQL DW, Azure SQL, Azure Table, File System, HDFS, MySQL, ODBC, Oracle, Salesforce, SAP HANA, SAP BW, SQL Server

JSON only:

  • Data stores with JSON support only require you to manually craft and input the necessary JSON syntax.
  • JSON only data stores include: Search Index, Cassandra, HTTP file, Mongo DB, OData, Relational table, Dynamics 365, Dynamics CRM, Web table, AWS Marketplace, PostgreSQL, Concur, Couchbase, Drill, Oracle Eloqua, Google Big Query, Greenplum, HBase, Hive, HubSpot, Apache Impala, Jira, Magento, MariaDB, Marketo, PayPal, Phoenix, Presto, QuickBooks, ServiceNow, Shopify, Spark, Square, Xero, Zoho, DB2, FTP, GE Historian, Informix, Microsoft Access, MongoDB, SAP Cloud for customer

Demo: Blob Storage to Cosmos DB with Zero Code

In this demo we will create a pipeline that copies data from a JSON document stored in Azure Blob Storage to an Azure Cosmos DB collection. Both these data stores have first class support and are therefore fully configurable via the GUI.

adfDemo1.png

Prerequisites

High Level Steps

  1. Upload a JSON document to Azure Blob Storage.
  2. Create a new collection in Cosmos DB.
  3. Create, configure and publish a pipeline in Azure Data Factory.
  4. Publish, run and check the results.

Video