Why Data Scientists Are Flocking to Python

Python is known as a highly accessible programming language that is widely considered to be an essential language for data scientists.

Surveys have indicated that Python is the top choice for data professionals, ahead of SQL and R, which are themselves substantially ahead of traditional programming languages such as Java and C.

Python for data science

First created as a general-purpose, interpreted language in 1991, the popularity of Python by data professionals did not happen overnight – it evolved into a must-know by data professionals due to its ability to easily manipulate data and use it with advanced analysis tools or AI models.

Indeed, Seth Dobrin, the vice president of IBM's Data and AI unit and chief data officer of IBM Cloud and cognitive software noted that the ability to code using Python is the common thread for all roles on the data science team today.

Interviewing for a role on Dobrin’s data science team entails passing a coding challenge that candidates complete on their own, followed by a monitored coding session with a senior member of the team.

Its relevance has led to a proliferation of Python courses for data professionals. For instance, the National University of Singapore offers a Python for Data Course for learners looking to use Python as a data science tool for programming and business analysis.

But while Python as a star programming language for data science is indisputable, what are its strengths, and how can organizations leverage it?

Strengths of Python

Designed to be easy to understand and code, Python’s top draw is probably its simplicity. The syntax supports different coding styles, resulting in better productivity over strongly typed languages like Java, or languages with a high learning curve such as C++.

One appeal of Python to data scientists is the many libraries that Python can easily access. This includes libraries for data manipulation, mathematical and scientific computations, and visualizations, among others.

Moreover, many AI libraries for deep neural networks, machine learning, and data mining applications can also be accessed using Python. Facebook, which runs trillions of inference operations a day, relies on AI models built with PyTorch.

Developed at the social networking giant for applications such as computer vision and natural language processing, PyTorch sports what Facebook engineering director Lin Qiao calls a “first-class” Python integration.

Finally, Python has excellent in-built processing abilities that span traditional and unstructured data. Of course, memory mapping is probably unavoidable for larger datasets in the 10s or 100s of gigabytes. But with the correct libraries, even that should be easier with Python than with any other language.

As a bonus, the fact that Python is compatible with all major platforms means that data scientists (or students) can run it on practically any computing system, including the new ARM-based MacBook.

Practical uses of Python

What are some practical uses of Python? A recent article on Analytics Insight outlined some ways that Python can be utilized.

  • Data gathering: Identifying the right datasets to use in a model is an essential, but time-consuming task. With the ability to quickly filter and pick up pertinent data, mundane data gathering tasks can be automated using Python.
  • Data cleaning: Cleaning “dirty” data is considered one of the most expensive and tedious tasks in data science. Yet dirty data can culminate in lost productivity, wasted resources, or erroneous conclusions. Python scripts can be written to quickly identify errors and correct minor issues such as data formats.
  • Data exploration: Phyton can facilitate deeper data exploration to identify patterns and draw inferences from data. With the ability to quickly manipulate data points and their relationships, Python can facilitate the discovery of new insights to improve the bottom line.
  • Data visualization: With hundreds of available data visualization libraries, data professionals can use Python to visualize data for a visual representation to identify trends and understand any dataset.

Ultimately, the popularity of Python lends itself to a virtuous cycle of success. As more data scientists use Python, existing code repositories, tools, and ecosystems around Python will grow, giving newcomers an even greater incentive to learn Python and use it for their data science initiatives.

Paul Mah is the editor of DSAITrends. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose. You can reach him at [email protected].​

Image credit: iStockphoto/Alfribeiro