Software Engineering is Coming for More Than Data Tools/Practices: It's Coming for Identity

+ Incentives

Jul 10, 2023

It’s been awhile since my first post on naming the problem plaguing data people everyday. Context switching and mental load is the norm, causes fatigue, and it feels like playing defense for the majority of the craft. That’s still true today, but I feel particularly charmed and hopeful by the new momentum I’m seeing from players evolving the craft with their own attack vectors. It makes me think:

Shipping data is more valuable than ever

Note: this list is from a shower thought session where I connected the dots between all the news that feels random but saw the thread weaving them all together

Dagster is catalyzing their pace with their latest round of funding and doubling down on their software defined asset pattern for data pipelines
I saw someone post about dpm(think: using data as easy as a `pip install`) and instantly whiplashed to an adjacent topic I wrote about here:
The Analytics Engineering Roundup
dbt is learning to ❤️ software engineers
In the most recent podcast episode, Jordan Tigani of MotherDuck and DuckDB flips the vision for "data apps" on its head. Tristan, Julia and Jordan dive into the origin story of BigQuery, why Jordan thinks we should do away with the concept of working in files, and how truly performant “data apps” will require bringing data to an end user’s machine (rath…
Read more
2 years ago · 13 likes · 2 comments · Anna Filippova and Sung Won Chung
Rust is becoming a nascent player in data work, especially with the release of pydantic v2 (think: python data validation) and polars (think: pandas alternative)
Motherduck launched their product to bring hybrid compute running duckdb on your laptop and their cloud (works best with a data lake). Works with swift 👀
Apparently, you can deploy data APIs using SQL + jinja: VulcanSQL
Proprietary data is in the business zeitgeist because the generative AI scene needs a good shovel
- Ex: Reddit monetized their API
Accountants see the value in clean, structured data in working with ChatGPT
- “You could then use Code Interpreter to do financial analysis on top of all this—for things like discounted cash flow—and now you have an automated finance department. It won’t do all the work, but it’ll get you 90% of the way there. And the 10% remaining labor looks a lot more like the job of a data engineer than a financial analyst.”
An army of laid off, highly skilled, highly paid, software engineers are on the job market and not settling for less than what they’re worth
- All dollars point to: data + AI

This tidal wave of stories is a forcing function because we’re all running to sell data and pump it into the next great LLM. There’s something electrifyingly existential about the data industry’s shift that requires more than virtue signaling to “best practices” (read: because there are none yet) to accomplish this. It isn’t enough to bring the theatrics of software engineers to data engineers. It forces us to evolve who we are. Or more plainly said:

Because data pipelines are a business:

Data engineers are becoming software engineers.

Our Stories Converge More than We Think

Don’t worry, your skepticism meets my benchmark too. We all know the historical chasm between how much or really how little there’s any real empathy and shared experience between software and data engineers. Heck, I know some software engineers that have never written analytical SQL in their life. Heck, I’m someone who’s never written a line of JavaScript in my life. Heck! I just happened upon this post from Chad Sanderson emphasizing the workflow divide and potentially refuting the whole premise of this post here! And, he make great points as shipping data for internal stakeholders is a very different game from selling software to customers.

And there’s a reason for this divide: neither side has ever had to care in the same way. We haven’t tasted wins together. We don’t taste the same flavors of pain…until now.

Everyone has to care about the data and software craft merging together whether they like it or not or get left behind. That requires an upgrade to what we think as ownership (and what enables it).

Nina’s story feels lot like when a software engineer is on call. You just need to replace a couple things:

Your website is broken when you open it one day
You furiously check Datadog and database logs
Someone changed a couple variable names in prod when they shouldn’t have
Integrations aren’t working with the right variables
You validate whether this effects all users or just some
You update a status page with the specific problem and say, “we’re working on it”
You revert the latest changes, test it locally, and redeploy to prod and eyeball Datadog logs and the public website until your eyes pop out with relief: mission complete

And why does emphasizing this pain matter?! Because it’s building empathy by default. It’s the glimmers of how great software is born:

It shouldn’t hurt this much for me. It shouldn’t hurt this much for you. It shouldn’t hurt this much for us. This shouldn’t be part of what we do and who we are.
Let’s fix it.

When data pipelines are the business, we need tools and practices ~~good~~ amazing enough so we’re not swimming through data swamps to get our jobs done. And because of these roaring incentives, industry momentum, and personal stories, data engineers are embodying what software engineers feel all the time: extreme ownership. Because when data is down like software is down, you lose customers.

Let’s Get Specific

With that ownership comes addressing the hard things head on. There are lots of painful things we don’t have elegant answers for. When we’re selling data as product, we can’t run ragged each time to ship a data pipeline. Like, imagine if a chef presents you their dish, mouth-breathing along with the staff sweating all over, broken dishes, and smoke billowing from the kitchen. It’s not a good look.

I want data engineer identities to evolve so much as to make the below so ridiculously easy in 2 years:

Fast is measured in seconds, not minutes
Data validation
Data diffing
Data observability
Data sharing
CI/CD
Schema evolution (a man can dream)
Backfills (a man only has nightmares)

And you know what? I want to catalyze this momentum. Complaining isn’t satisfactory (see me do more than complaining below). I want the above to feel as silly as saying Microsoft Office is hard.

flow state tool: Demo 🚀 - Watch Video

In the short term, my story starts with something simple. Build and role model data tools so absurdly useful and elegant that data engineers go, “Ooo this guy is having more fun and is in a lot less developer pain. I want some of that fun!” Like, back in 2005 when my friends and I first played God of War and we passed the controller each time one of us died because it was so gosh darn fun.

And, I want to build it with software AND data engineers. Software engineers are going to look at our craft and have enough lessons learned and battle scars that they’ll inadvertently evolve developer ergonomics (read: isn’t every new JavaScript framework about developer experience?). Data engineers will look at software engineers’ data models and inadvertently optimize them for both application and analytics. Because remember, we’re selling data now, and it should feel good while doing it.

Ultimately…

I want to live in a world where data pipelines are less defense and more offense
I want to live in a world where really 🧃 juicy data as product flows everywhere
I want to live in a world where there’s a lot less unnecessary pain

Because then something magical happens:

We surpass who we thought we were supposed to be!

Thank you for reading! Share this if you feel the same. Heck! Share it if you disagree. What matters is that you care :)

Sung’s Substack

Software Engineering is Coming for More Than Data Tools/Practices: It's Coming for Identity

+ Incentives

Our Stories Converge More than We Think

Let’s Get Specific

Ultimately…