To succeed with analytics, organizations need to start their journey with a set of business questions that they want to answer with data analytics, says Dean Samuels, a lead technologist at Amazon Web Services (AWS). This should then be matched to available data sources which can be used to provide those insights, before making a decision on the best technology to ingest, store, analyze, and process the answers.
Unfortunately, many organizations do this backwards when they make technology-driven decisions rather than business-driven ones, he explained in response to a query by CDOTrends: “They may want to get on the technology bandwagon without focusing on what they want to achieve as a business.”
Start with meaningful data
With that out of the way, the actual data science journey begins when organizations start collecting meaningful data. On that front, Samuels suggests that businesses adopt a “working backwards” methodology when it comes to setting up their data infrastructure.
This typically entails focusing on the desired result and using it to figure out what needs to be done. “The approach ensures our teams are focused on our customers wants and needs. We encourage our own customers to take the same approach and work backwards from their end state,” Samuels said.
“They need to leverage a modern data platform. This is to ensure that their data infrastructure can scale and provides ﬂexibility to provide the insights they need. Their data infrastructure should be able to ingest and store, process and analyze – with storage decoupled from compute, and visualize their data, all whilst being secure.”
Supporting citizen data scientists
As awareness of the value that citizen data scientists can bring increases, what are some ways that organizations can best support them? The key here is having the leeway for experimentation, according to Samuel. “Experimentation is key to innovation. Data scientists should try out new ideas by having a well-deﬁned hypothesis, data, metrics to prove or disprove the hypothesis, and the ability to pivot to the next idea.”
With access to the right data tools and judicious automation, both citizen data scientists and machine learning experts can hence focus on activities and responsibilities that bring value and a competitive edge to their organization, says Samuel, without being stiﬂed by “undifferentiated heavy lifting” such as manual and repeatable processes.
Samuels recommends that organizations adopt the 80/20 rule, with the bulk of the citizen data scientists’ time on tasks such as inventing, experimenting and bringing ideas to reality. Developers should ideally be able to incorporate machine learning without having to be specialists, while specialists can continue to deploy the advanced data science frameworks they need to get the job done.
“For example, at AWS, we aim to put machine learning in the hands of every developer and data scientist. What that means is we want to make it easy for developers to integrate machine learning into their applications without needing to be deep machine learning specialists, and allow deep machine learning specialists to have the ability to customize, choose and leverage the machine learning frameworks and infrastructure technologies they need to build, train and deploy their machine learning models.”
The time is now
Unsurprisingly Samuels recommends the cloud for its flexibility and agility, as well as the ability to start small and scale up quickly when needed. He also recommends establishing data lakes in the cloud, where complex queries can be quickly performed on massive volumes of structured, semi-structured and unstructured data.
“They can also run the analytics they need to deliver the insights and guide better decisions. When done right, a data lake can open the door to a whole new set of advanced analytics, facilitating data science and machine learning,” he said.
But are there scenarios where an on-premises deployment might make more sense? “If companies choose to deploy on-premise infrastructure for their data science needs, it is normally due to unique needs such as low latency for near real-time processing, data residency and data sovereignty.”
Still undecided? Samuels summed up the top areas to potentially kick off your organization’s data science initiative: “There are three key areas for data-driven development: retrospective for reporting and analyzing based on past behaviors, here and now through real-time analysis, and dash-boarding and using machine learning technologies to make predictions and decisions by creating smart applications.”
Photo credit: iStockphoto/Madmaxer