Alex Volok

Category Archives: Tools

Scripting PySpark DataFrames

Developing Spark applications means dealing with Spark DataFrames. These objects are in-memory data structures accessible via various APIs but locally scoped. Sometimes we need to use them outside the runtime environment. Scripting a dataframe as the collection of Python commands that fully reproduce this object is a possible and occasionally preferred solution. The use cases […]

July 15, 2023 Alex Volok

Azure Databricks, Tools

Delta Properties and Check Constraints at Scale

Delta arguably is the most popular data format in the current Data Engineering landscape. It is mainly used with Databricks/Spark ecosystem; however, this is optional. It is a universal open-source format and works great out of the box. However, it might be that the default behavior is not sufficient for some use cases, and then […]

July 14, 2023 Alex Volok

Azure, Tools

Delta-RS & DuckDB: Read and Write Delta Without Spark

I have used Apache Spark (often as Azure Databricks) for some years and see it as a sledgehammer in data processing. It is a reliable tool built on JVM, which does in-memory processing and can spin up multiple workers to distribute workload to handle various use cases. It does not matter: whether small or considerable […]

April 22, 2023 Alex Volok

Tools

Data Engineering Development Environment using WSL and Miniconda

Recently I managed to get the Windows Subsystem for Linux configured on my Windows 10 workstation. So, the logical next step was to use it by getting the PySpark installed, configured, and running there. However, instead of running dozens of pip install and manual post-configuration of all those packages, I decided to rely on the […]

April 7, 2021 Alex Volok

Tools

Preparing Windows 10 for Data Engineering or WSL(2) ❤ Windows Terminal

Recently I have to rethink the way how I use Ubuntu as the local sandbox environment. The common approach was to have a small-sized VM running in the VirtualBox or sometimes even a dedicated VM in Azure/AWS. However, Windows Subsystem for Linux (or WSL) sounds more and more often as the lightweight and easy-to-use alternative […]

March 30, 2021 Alex Volok