Farm-to-Table: Rationalizing Data Preparation as a Part of Technology Strategy
Are your insights free-range, organic, and locally sourced?
I’ve always been more of a baker than a chef. To me, baking feels a lot like programming — a baker takes ingredients and combines them into complex structures through measurement, process, and experimentation to create something greater than the sum of its parts. Answering practical questions such as “How do I make these muffins more moist?” requires thinking through the baking process mechanically to understand how the amount of ingredients and their relationship to other ingredients contribute to the delicious baked good.1
I’m fortunate to live in an age where I can decide what I want to bake, add the ingredients to a grocery list, and pick them up ready-to-bake from Publix. I don’t need to go through a complicated process to refine sugar cane into granulated sugar just to get a cup of sugar. I don’t need to churn my own butter from heavy cream. I don’t need to go build a coop to keep chickens just to have eggs (nor do I even need to crack my own eggs if I am so inclined). They’re all things that I could do if I had a very specific reason for it, such as if I thought homemade butter tasted better, but for 99.9% of all baking I do, pre-processed store-bought ingredients taste great and make baking a practical hobby as opposed to a months-long endeavor.
Analysts are a lot like bakers. Their goal is to serve up delicious insights for their business stakeholders. This requires combining source data in specific ways until it becomes more than just the sum of its parts. However, unlike bakers, analysts don’t always have ready-to-bake ingredients available to them. If an analyst needs butter, they need to go and churn it themselves. If they need eggs, they need to go get eggs from the coop and make sure they aren’t bad. Maybe they can substitute honey or applesauce for sugar in a pinch so they don’t need to go harvest the sugar cane themselves, but it’s just not going to taste the same.
This is where data engineers and analytics engineers become valuable. Data engineers and analytics engineers are farmers, food processors, health inspectors, shipping networks, and grocery stores all in one. Data engineers and analytics engineers maintain the coops, raise the chickens, check the eggs for salmonella, place them into a carton, and get them to the shelves of the grocery store so all the analyst needs to do is decide if they want the cheap eggs or the fancy eggs with Omega-3s.
This ingredient lifecycle of sugar cane seed to baked cake for data is called the Data Preparation process. It consists of 3 discrete activities:
Data Cleaning is the process of removing erroneous or problematic data through systematic means. This includes everything from correcting spelling errors to removing duplicate records or even sanitizing SQL injections.
This is handled through a partnership between data engineering teams and the organization’s wider technology team through a combination of automated scripts and troubleshooting the source to prevent problems from occurring in the future.
Within our baking metaphor, this is quality control disposing of ingredients that have gone bad, or farmers treating sick animals.
Data Tidying is the organization of data into easily understandable and readily accessible tables.
This is handled by either data engineering teams or analytics engineering teams, depending on how an organization segments its data team.
Tidy tables contained in an organization’s data warehouse or data lake (often called an Analytics Layer) are considered the canonical data for an organization. Each table serves as a technical capability that provides the organization reliable and modular access to insights regarding the tables’ subject and attributes. The tables are also a starting point for data analysts, raw material for data products, or points of integration for other parts of an organization’s Enterprise Architecture via reverse ETL.
Supporting these tables’ data quality is the primary scope of the data governance function.
Within our baking metaphor, this is everything required to process, package, and ship raw ingredients to stores.
Data Shaping is the process of sorting, joining, aggregating, filtering, and pivoting data to create a model suitable for a specific analysis.
This is the primary task of data analysts looking for insights.
This is the actual process of baking the cake!
The more mature an organization’s data and technology teams, the more that these functions are clearly defined and ownership of each stage is given to different teams. The less mature an organization is, the more the “data shaping” activity also requires cleaning and tidying. Thus your bakers start to pull double duty as farmers, health inspectors, and logistics coordinators.
Part of the task in front of data leaders is creating and implementing a technology strategy that gets easy-to-use data in the hands of analysts so they are not repeating the same transformations over and over again. In an ideal world, analysts’ time is spent only on transforming data relevant to the problem to get insights for the business. In baking terms, this means getting ingredients in front of analysts so they’re spending all their time in the kitchen and none on the farm.
This is a tricky balance to strike because to very engineering-centric data leaders, everything can seem like an automation opportunity. However, this approach can lead to your data engineering team churning out Twinkies on a factory line as opposed to baking hand-made cakes and brownies to taste. Leaders need to work hand-in-hand with engineers and analysts to understand recurring patterns in analysis and prioritize the robust capabilities that need to be built to support analysts day-to-day. Think of this as looking at the last few dozen recipes that you’ve baked to determine which ingredients you need and what forms they work best in.
To help find this balance, my next two posts will be about the Data Tidying and Data Shaping process. Design patterns exist that can be applied to architecting individual data grains within your analytics layer to maximize end-user analyst flexibility, much like giving a baker packaged ingredients. Likewise, techniques exist for analysts to efficiently model tidy data into insights so that more than take-and-break cookies can be on the dessert menu.
The answer, by the way, is doubling the Greek yogurt that the recipe asked for.
Awesome 👏
Favorite line, “In baking terms, this means getting ingredients in front of analysts so they’re spending all their time in the kitchen and none on the farm.”
Scott is injecting these ideas at Eleanor Health. Having him as our data leader is allowing us analytics engineers the ability to focus on the farming - it’s a brilliant model and I’m grateful to be along for the ride 😎