Data Science Data Management

Defensive Python Coding for Data Scientists

By Paul Mah
October 13, 2021

Should modern developers be trained in secure coding techniques? This topic came up at a recent panel discussion hosted by CDOTrends when a participant asked if they should even hire developers who cannot write secure code.

As news of cybersecurity breaches and hacking incidents continue to make the headlines globally, there is no question that a programmer would do well to pick up secure coding practices.

But should data scientists care? After all, many dabble in Python to manage and manipulate data as part of their day-to-day work. And though the code is typically not public-facing, there is no guarantee that it will stay within the corporate network. Moreover, writing secure code, like brushing your teeth or defensive driving, is a valuable lifelong skill. So why not?

Tips to writing secure Python code

In a blog post published last month, cloud-native application security provider Snyk outlined various security best practices for Python as part of its updated 2021 Cheat Sheet. What are some best practices that data scientists should be aware of? I highlight four of them below and explain why.

Sanitize external data

According to Synk’s Frank Fischer, one vector of attack for any application is external data. While this is less likely a problem for data scientists than a developer’s Python app running on the company website, it is entirely plausible that an injection attack can happen through poisoned data as part of a watering-hole or spear-phishing attempt.

The best defense against this is to sanitize data from external sources thoroughly and ensure inputs conform to expected data structures. Bleach is a popular HTML sanitizing library for content scraped from a website, while major frameworks such as Flask or Django come with their sanitation functions in the form of flask.escape() and Django.utils.html.escape(). Use them.

Be careful with downloaded packages

Data scientists learning Python for the first time would probably remember the amazing experience of typing in a line of Python code to automatically download packages that offer a plethora of new capabilities. Developers typically use the standard package installer for Python (pip), explained Fischer, which uses the Python Pack Index (PyPI). Long story short, the possibility of malicious packages within PyPI exists, especially common misspellings. So, be sure to spell out the package name carefully.

One alternative for data scientists to sidestep this issue is to go with something like Anaconda, which comes with many of the top Python packages already bundled. It is free for individual use, or your organization might already have a license for the commercial edition.

Set DEBUG = False in production

For data scientists, it makes sense to set DEBUG to false once the code is written and verified to work correctly. This is because the code could be reused by other team members who might deploy it in more public-facing settings without much thought.

According to Fischer, most frameworks have debugging enabled by default. So, make sure to switch off debugging in your favorite frameworks to prevent the accidental leaking of sensitive application information to attackers.

Deserialize cautiously

One of the strengths of Python is how it will understand the context of a variable without having to explicitly define its data type. This makes Python beginner-friendly and is helped by the ease of loading data from various sources.

But if you are considering using the pickle module to serialize or de-serialize a Python object structure, Fischer notes that the module is considered insecure and should only be used on trusted data sources. Use YAML instead, he suggests, and use SafeLoader() instead of Loader() as a loader.

You can download the Python Security Best Practices Cheat Sheet 2021 here (pdf).

Paul Mah is the editor of DSAITrends. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose. You can reach him at [email protected].

Image credit: iStockphoto/gorodenkoff

Paul Mah

Paul Mah is the editor of DSAITrends, where he report on the latest developments in data science and AI. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose.