What exactly do data engineers do? Are they junior software engineers that only focus on analytics, or glorified data peddlers? These are probably familiar – and frustrating – questions that those working in the field get from family and friends. And we are not even counting those who don’t understand but never got around to ask.
For Samantha Zeitlin, a lead data scientist (and who also has a PhD in Biochemistry and cell biology), explaining her job to a colleague turned into a full-length blog post that should serve as a handy bookmark for the above-mentioned group of people.
We sum up three key takeaways below.
Of data scientists and data engineers
First, what is the difference between a data engineer and data scientist? The former is focused on the acquisition of data, including its requisition, cleaning, and validation. On the other hand, data scientists are more concerned with analyzing the data, which might call for the building of data models and the use of machine learning.
While there are some inevitable overlaps, a data engineer will typically oversee more of the data pipelines while the data scientist will own more of the data analyses and data models.
“Data engineering is getting data, cleaning data, reshaping data, validating data, and loading it into databases. Data science is all of that, plus analyzing the data and figuring out how to display it in a way that makes sense, and sometimes also building models and doing machine learning,” explained Zeitlin.
Data engineers are software engineers, too
That data engineers are somehow not real engineers is a common misconception. This can probably be attributed to the fact that no usable product gets produced; results are often invisible and used to inform executive decisions or populate sections of the website or mobile app.
Yet the invisible product is as real as it gets. Multiple stakeholders within the organization benefit from what data engineers (and data scientists) produce. Beneficiaries might include business analysts, marketing, sales, product managers, and finance – among others.
What’s more, data engineers are technically capable. The ability to program to manipulate data is a given. Moreover, they usually deploy their own technology stack to help them with their disparate data management responsibilities. Additional effort must also be made to architect and deploy systems to manage data and process it.
Data engineering is hard
If you think about it, the skillset for a typical software engineer might be even more demanding in terms of the breadth and depth of knowledge. From writing and running tests to debugging tools, setting up containers to process data, and the expertise to understand and manage multiple database systems, the list goes on.
“[We] have to know about, and typically have to support, a lot of different databases… We do a lot of ETL [Extra, transform, load]. We have special tools that we use to automate ETL… We have to know all about streaming and cloud stuff [and] understand distributed systems… We also use containers,” Zeitlin said.
It is worth noting that some of these tools – such as containers, are under rapid development. This means learning them can be a “moving target”, and the humble data engineer must be quick on the uptime to thrive.
The next time someone ask about your job, just point them to this post.
You can read Zeitlin’s full blog post here.
Image credit: iStockphoto/imtmphoto