Azure Data Lake Series: Working with JSON - Part 1

Prerequisites

Upload Custom JSON Assemblies

As of today, Azure Data Lake Analytics is unable to natively query JSON documents immediately without the addition of custom assemblies. Assemblies are simply chunks of precompiled code (in this instance, in the form of DLL files) which can be uploaded to Azure Data Lake Store and referenced in U-SQL to provide extended capability (e.g. JSON parsing functions).

The two assemblies that we will be using are:

  • Newtonsoft.Json.dll (Documentation: Json.NET)
  • Microsoft.Analytics.Samples.Formats.dll (Documentation: GitHub)

Instructions

  1. Download and un-pack assemblies.zip
  2. Navigate to your Azure Data Lake Store resource (e.g. Azure Portal > All Resources > "Your Azure Data Lake Store")
  3. Open Data Explorer.
  4. Click Upload.
  5. Select the DLL files (Newtonsoft.Json.dll and Microsoft.Analytics.Samples.Formats.dll).
  6. Click Add selected files.

Note: At this point, the assemblies have simply been uploaded and made available on the Azure Data Lake Store. Our U-SQL script will encompass the necessary lines of code to CREATE and REFERENCE these assemblies.

upload_assemblies.gif

 

Exercise #1 - Simple JSON Document

In this example, we are starting off with an extremely basic JSON document structure to simply gain a fundamental understanding on how to get started. In other words, no nested fields, no nested arrays, single file, etc. We will get to tackle progressively complex scenarios in later exercises.

Instructions

  1. Upload exercise01.json to Azure Data Lake Store.
  2. Navigate to your Azure Data Lake Analytics resource (e.g. Azure Portal > All Resources > "Your Azure Data Lake Analytics").
  3. Click New Job.
  4. Copy and paste the code from exercise01.usql (below).
  5. Update the pasted code by replacing the text "ENTER_YOUR_ADLS_NAME" with the name of your Azure Data Lake Store account.
  6. Click Submit.

Input: exercise01.json

{
    "title": "The Godfather",
    "rating": "R",
    "genre": "Crime, Drama",
    "year": 1972
}

Transformation: exercise01.usql

// A. CREATE ASSEMBLY: Register assemblies (if they do not already exist).
CREATE ASSEMBLY IF NOT EXISTS [Newtonsoft.Json] FROM @"adl://ENTER_YOUR_ADLS_NAME.azuredatalakestore.net/Newtonsoft.Json.dll";
CREATE ASSEMBLY IF NOT EXISTS [Microsoft.Analytics.Samples.Formats] FROM @"adl://ENTER_YOUR_ADLS_NAME.azuredatalakestore.net/Microsoft.Analytics.Samples.Formats.dll";
 
// B. REFERENCE ASSEMBLY: Load assemblies for compile time and execution.
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];

// C. USING: Specify namespace to shorten function names (e.g. Microsoft.Analytics.Samples.Formats.Json.JsonExtractor)
USING Microsoft.Analytics.Samples.Formats.Json;

// 1. Initialise variables for Input (e.g. JSON) and Output (e.g. CSV).
DECLARE @InputFile string = @"adl://ENTER_YOUR_ADLS_NAME.azuredatalakestore.net/exercise01.json";
DECLARE @OutputFile string = @"adl://ENTER_YOUR_ADLS_NAME.azuredatalakestore.net/exercise01.csv";

// 2. Extract string content from JSON document (i.e. schema on read).
@json =
EXTRACT
    title string,
    rating string,
    genre string,
    year string
FROM
    @InputFile
USING new JsonExtractor();

// 3. Write values to CSV
OUTPUT @json
TO @OutputFile
USING Outputters.Csv(outputHeader:true,quoting:true);
"title","rating","genre","year"
"The Godfather","R","Crime, Drama","1972"

Video

exercise01.gif

High Level Flow

  1. JSON document is loaded and made available as a data source on Azure Data Lake Store.
  2. U-SQL job is executed allowing us to query the JSON document via Azure Data Lake Analytics.
  3. Results of the job are saved as a CSV file back on Azure Data Lake Store.
Screen Shot 2018-01-05 at 9.27.01 pm.png

Next Steps

While this example is a good starting point, the JSON documents you will likely encounter in the wild will be far more complex. Nested arrays, elements at varying levels, inconsistent fields, requirements for string manipulation, etc.

Check out Azure Data Lake Series: Working with JSON - Part 2 to see how we handle our JSON example as it evolves from containing a single movie to an array.

Video: Mastering JSON in Azure Data Lake with U-SQL