Data engineers or data scientists who work with SQL should treat it like source code, asserts a data professional going by the pseudonym of “DataExpert”.
In a blog published on Dev Genius, a publication of Medium, the author argued that managing SQL as code should be incorporated into the data strategy of all organizations – including the smallest organizations.
As one might expect, the post generated a fair amount of discussion from data and business analysts. But first, why should data professionals and citizen data scientists care about SQL?
SQL is here to stay
SQL or Structured Query Language was designed to access and manage data in relational database management systems (DBMS). First developed in 1970, SQL today is generally used with programming languages such as Python to retrieve data from corporate database systems for further analysis.
Indeed, SQL is so entrenched within data repositories that it is practically impossible to avoid even in cloud-based database systems or next-gen data platforms.
For instance, when Bob Muglia, former CEO of Snowflake recently spoke about the state of data platforms and how analytic systems will merge with data platforms, he conceded that SQL will continue to play a crucial role for now.
“I think you’ll see analytic systems merging into the data platforms… You’ll see a very complete stack that will have both analytics and advanced analytics and machine learning systems, together with SQL-based data management systems,” Muglia was quoted as saying.
In a nutshell, knowledge of SQL is vital for those working in data and is useful for analysts or managers with a need to query databases.
If it looks and works like programming code
Like object-oriented code, SQL is time-intensive to write, debug, and maintain, writes DataExpert, and should hence be treated as programming code.
This calls for a seamless system to deploy new SQL code or make changes to the development environment. Only then can failures be identified and fixed instantly – or in the event bugs are inadvertently introduced, to revert the production to the last working state.
To achieve this, organizations must drop their laissez-faire treatment of SQL for a more disciplined approach. Existing scripts should be centralized to a global location and changes to SQL statements meticulously tracked. Finally, changes to SQL must be tested and approved by another engineer before code changes are committed to the repository.
How should organizations begin? “Pick a code repository and stick with it. The code repository should ideally be shared with engineering, but at the very least, centralize all SQL code in a repository. This should include all data functions such as data engineering, analytics, [and] business intelligence,” he wrote.
The author also recommended that organizations implement a proper development environment, use version control to catch erroneous changes to the code base and make the codebase more accessible to relevant employees across the organization.
Getting ahead of the curve
In response to the recommendations, a business intelligence developer agreed on Reddit that SQL can devolve into a nightmare to maintain without a proper system to track code changes.
Sharing from his experience at a healthcare organization, the developer by the pseudonym of “Touvejs” noted that some SQL scripts can run into thousands of lines. They might also be several years old and have been modified dozens of times to fix freshly-discovered bugs or to incorporate new features.
With no version control feature, things get messy quickly: “[A manual approach is] an awful system. Non-revertible changes. Previous versions lost. No insight into exactly what lines changed,” writes Touvejs.
What is clear is that there isn’t a single tool or technique to improve the situation, but that acknowledging the issue and centralizing SQL code is the way to start. Another commentator on Reddit suggests that an aggressive approach works best.
“[My deployments drop] everything in the target database that's not under source control… I often [annoy] developers when I join a project because I keep deleting their stuff off the [development] servers. But eventually, they learn to check in their SQL, or I take away their admin rights to the databases,” he wrote.
There are tools out there to help organizations work with data and SQL, explained another commentator, and businesses will eventually incorporate areas such as data engineering, data warehousing, and machine learning into the data pipeline to ease the adding of new features.
For those still mulling how to manage their SQL as code, another commentator had this encouraging point to add: “[Consensus] seems to be that very few companies treat SQL like objected-oriented code. If your company does; you’re ahead of the curve!”
Paul Mah is the editor of DSAITrends. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose. You can reach him at [email protected].
Image credit: iStockphoto/die-phalanx