Alex Volok

Category Archives: Azure Databricks

Scripting PySpark DataFrames

Developing Spark applications means dealing with Spark DataFrames. These objects are in-memory data structures accessible via various APIs but locally scoped. Sometimes we need to use them outside the runtime environment. Scripting a dataframe as the collection of Python commands that fully reproduce this object is a possible and occasionally preferred solution. The use cases […]

July 15, 2023 Alex Volok

Azure Databricks, Tools

Delta Properties and Check Constraints at Scale

Delta arguably is the most popular data format in the current Data Engineering landscape. It is mainly used with Databricks/Spark ecosystem; however, this is optional. It is a universal open-source format and works great out of the box. However, it might be that the default behavior is not sufficient for some use cases, and then […]

July 14, 2023 Alex Volok

Azure Databricks, Career

Towards Databricks Certified Data Engineer Professional

I decided to obtain the Databricks Certified Data Engineer Professional recognition within 2023. It was a New Year Resolution goal that resonated well with my career path. Surprisingly, few related study materials, blog posts, or forum discussions were available. I passed the exam on the first try, though, going blindly. This story is a compilation […]

July 14, 2023 Alex Volok

Azure, Azure Databricks, Observability

Databricks Observability: Processing Collected Ganglia Metrics

Introduction In this article, we will explore the use of collected Ganglia metrics and their preparation for observability tools like Grafana. We will begin by examining how the metrics files are organized and what kind of content they hold. Then, we will discuss how the metrics going to be processed by creating a notebook to […]

April 15, 2023 Alex Volok

Azure, Azure Databricks, Observability

Databricks Observability: Collecting Cluster Ganglia Metrics

Introduction This blog post is about a few host metrics like CPU and RAM usage and their importance in understanding cluster utilisation. The simple task of tracking them might turn into a non-trivial task if the number of clusters growing over time. The real challenge is to track such metrics for short-living job clusters. This […]

April 2, 2023 Alex Volok