AI-Powered GitHub Copilot May Be What Data Scientists Need

Microsoft’s GitHub last week launched the technical preview of a new AI-powered tool it says is designed for “pair programming” and to help software developers (or budding data scientists) write better code with less effort.

For the uninitiated, pair programming is a software development technique where two human programmers work together at a workstation to write code and critique each other, frequently switching roles.

AI-powered coding

GitHub Copilot draws context from the code as they are written, suggesting new lines or entire blocks of code. GitHub claims the tools will help programmers discover alternate ways to solve problems, write tests, and explore unfamiliar APIs without the painstaking process of rewriting code fragments obtained from the Internet.

Powered by technology from OpenAI, a company that received a USD 1 billion investment from Microsoft in 2019, GitHub Copilot is trained on “billions of lines” of code hosted in public repositories on GitHub and elsewhere.

As the underlying AI engine is trained on both source code and natural language, it interprets both written code and comments when making suggestions. In one example shown on the project website, pseudocode written in English was enough for GitHub Copilot to generate an entire code block within seconds.

GitHub Copilot will also adapt over time to individual styles and preferences to complete work faster. GitHub says suggested code is uniquely generated, though its FAQ notes that snippets that are verbatim from the training set may appear “about 0.1% of the time”.

The announcement prompted a predictable storm among developers about the possibility of AI replacing developers. As reported by Visual Studio Magazine, a GitHub Copilot post on Hacker News generated 1,262 comments while a Reddit post on the same subject generated 575 comments at the time of writing.

“I've been using the [preview] for the past 2 weeks, and I'm blown away. Copilot guesses the exact code I want to write about one in ten times, and the rest of the time it suggests something rather good, or completely off. But when it guesses right, it feels like it's reading my mind,” wrote one commentator on Hacker News.

Why it matters to data scientists

For data scientists working on user-defined functions or code to clean or manipulate data, a tool such as GitHub Copilot can be an incredible time-saver for freeing up time to focus on analyses that matter.

Moreover, GitHub Copilot can help data scientists quickly work through a new or unfamiliar language without spending too much time reviewing documentation or asking for help. Even experienced data scientists can code more confidently or identify errors quicker.

And though GitHub Copilot does not work well with new APIs for which examples and code samples are scarce, this is typically not a problem with data scientists using time-honed APIs for mundane tasks such as extracting data from SQL databases or producing data visualizations from popular data platforms.

GitHub Copilot works with a broad set of frameworks and languages, though GitHub says the technical preview “works especially well” with languages such as Python, JavaScript, TypeScript, Ruby and Go.

You can sign up for GitHub Copilot here (Currently on waitlist).

Paul Mah is the editor of DSAITrends. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose. You can reach him at [email protected].​

Image credit: iStockphoto/Alfribeiro