Azure Data Lake gen2 & SQL Server – Getting started with a PolyBase

Imagine a scenario: your team runs SQL Server boxes, has plenty of experience in T-SQL and sometimes needs to dump results of the queries into a data lake. Or, the data is to be age-out and should be placed to an archival location. Or, yet another one scenario – the interactive querying of parquet files by joining them to local relational tables.

Starting with a release of SQL Server 2016 Microsoft delivered the PolyBase – a solution for such cases.  This post is about getting started with this feature and run a few demos: data ingestion and data retrieval.


So, what is the Polybase

Let’s start a theoretical part with the official definition:

PolyBase enables your SQL Server instance to process Transact-SQL queries that read data from external data sources. SQL Server 2016 and higher can access external data in Hadoop and Azure Blob Storage.

What is PolyBase?

With a SQL Server 2019 release, the list of external data sources was expanded to Oracle, Teradata, MongoDB, and ODBC Generic Types. In this post, I will focus on connectivity between SQL Server and HDFS based data lakes, like Azure Data Lake gen2.

PolyBase logically acts as a super-set on top of SQL Server Data Engine:

Continue reading…

Azure Data Factory & DevOps – Setting-up Continuous Delivery Pipeline

This is a final post of a DevOps series related to Classic pipelines, and I will touch here well-known practices: Continuous Integration/Continuous Delivery or CI/CD and how they relate to Azure Data Factory. It is a continuation of a one I published earlier: Azure Data Factory & DevOps – Integration with a Source Control.


  • Pre-configured development, test and production environments. They can be generated and configured by using scripts from those previous two posts.
  • Source Control enabled development environment. The development environment should have already a source control integration configured. This aspect was illustrated in a previous post

So, what the term “CI/CD” for ADF development means?

Before going forward to a practical part it worth to walk-through a terminology and understand some limitations of the current implementation of Data Factory v2.

Continuous integration is a coding philosophy and set of practices that drive development teams to implement small changes and check-in code to version control repositories frequently. The technical goal of CI is to establish a consistent and automated way to build, package, and test applications.
Continuous delivery picks up where continuous integration ends. CD automates the delivery of applications to selected infrastructure environments. Most teams work with multiple environments other than the production, such as development and testing environments, and CD ensures there is an automated way to push code changes to them.

What is CI/CD? Continuous integration and continuous delivery explained

Azure Data Factory v2 currently has limited support for the Continuous Integration (CI) concept. Of course, developers can still perform frequent code check-ins. However, there is no way to configure automated builds and packaging of the data factory pipelines. This is because the collaboration branch (normally it is master) stores JSON definitions of objects, like pipelines, datasets, linked services. However, such objects cannot be directly deployed to data factory service before they are rebuilt to the ARM template.

Continue reading…

Azure Data Factory & DevOps – Integration with a Source Control

A few recently published posts – Automated Deployment via Azure CLI and Post-Deployment Configuration via Azure CLI are about getting blank environments deployed and configured automatically and consistently. It is time now to integrate a development stage with Azure DevOps. In this post, I will make a step by step guide of how to add a Data Factory instance to a source control system. The next one will cover more Continuous Integration and Continuous Delivery or CI/CD.


  • Blank development, test and production environments. They can be generated and configured by using scripts from those previous two posts. So the whole landscape will eventually have a look similar to:

Why bother with source control and Azure DevOps?

The Data Factory currently supports two editing modes: Live and GIT integrated. While LIVE editing is an easy start and available by default, the other alternative has various advantages:

  • Source Control Integration and CI/CD:
    • Ability to track/audit changes.
    • Ability to revert changes that introduced bugs.
    • Release pipelines with automatic triggers as soon as changes published to a development stage.
  • Partial Saves:
    • With Git integration, you can continue saving your changes incrementally, and publish to the factory only when you are ready. Git acts as a staging place for your work.
  • Collaboration and Control:
    • A code review process.
    • Only certain people in the team allowed to “Publish” the changes to the factory and trigger an automated delivery to test / production
  • Showing diffs:
    • shows all resources/entities that got modified/added/deleted since the last time you published to your factory.
Continue reading…

Azure Data Factory & DevOps – Post Deployment Configuration via Azure CLI

Previous post – Automated Deployment via Azure CLI is an example of how to get a brand-new environment by firing a script. However, it comes pretty blankly and unconfigured. In this post, I would like to talk about an automated configuration of Azure Data Factory deployments. At least a few things can be scripted in most of the cases: creation of storage containers, upload of the sample data, creation of secrets in a Key Vault and granting access of a Data Factory to them.


Coding a PowerShell script

Step 1. A naming convention and resource names

This block is to define the name of resources as variables that are later going to be used by configuration commands.

param([String]$EnvironmentName = "adf-devops2020",` 
      [String]$Stage = "dev",` 
      [String]$Location = "westeurope"`
$OutputFormat = "table"  # other options: json | jsonc | yaml | tsv
# internal: assign resource names
$ResourceGroupName = "rg-$EnvironmentName-$Stage"
$ADFName = "adf-$EnvironmentName-$Stage"
$KeyVaultName ="kv-$EnvironmentName-$Stage"
$StorageName = "adls$EnvironmentName$Stage".Replace("-","")
Continue reading…

Azure Data Factory & DevOps – Automated Deployment via Azure CLI

In a coming series of posts, I would like to shift a focus to subjects like automation and DevOps. This post will show how the entire Data Factory environment and surrounding services can be deployed by a command line or be more precise, by the Azure CLI.

Such an approach may look tedious and overkilling at a first glance, however, it pays back since it brings nice things like reproducibility, enforcement of standards, avoidance of human mistakes.

Our sample environments

Beside of a plain data factory, it is not uncommon that data engineers use some extra Azure services. Often the landscape also includes a storage account, key vault, etc. All these pieces create together isolated environments for every stage: Development, Acceptance, and Production.

the roadmap

The picture above illustrates the importance of standardization and naming convention. All thee environments do not just look similar, they are created and named consistently. The most important aspect is that strict naming is a base for further automation, scripting and DevOps.

Continue reading…