Data Science Data Management

7 Emerging Open Source Big Data Projects That Will Revolutionize Business

By Jason Bissell and Calvin Hoon, Talend
August 27, 2018

Twenty years ago, the Open Source framework was published, delivering what would be the most significant trend in software development since that time. Whether you want to call it "free software" or "open source," ultimately, it’s all about making applications and system source codes widely available and putting the software under a license that favors user autonomy.

According to Ovum, open source is already the default option across several big data categories ranging from storage, analytics, and applications to machine learning. In the latest Black Duck Software and North Bridge's survey, 90% of respondents reported they rely on open source "for improved efficiency, innovation, and interoperability," most commonly because of "freedom from vendor lock-in; competitive features and technical capabilities; ability to customize; and overall quality." There are now thousands of successful open source projects that firms must strategically choose from to stay competitive.

While every firm must develop its strategy and choose the open source projects it feels will fuel its desired business outcomes, there are some projects that we feel are worth strong consideration.

Using Open Source for Business Agility

Following are a few of the big data open source projects that have the largest potential for enabling companies to have extreme agility and lightning fast responses to customers, business needs, and market challenges.

Apache Beam is a project model that got its name from combining the terms for big data processes batch and streaming. It offers a single model for both cases. Under the Beam model, you only need to design a data pipeline once and choose from multiple processing frameworks later. Your data pipeline is portable and flexible so that you can choose to make it a batch or a stream. This way, your team can benefit from much greater agility and flexibility to reuse data pipelines, while choosing the right processing engine for multiple use cases.
Apache Airflow is ideal for automated, smart scheduling of Beam pipelines to optimize processes and organize projects. Among other beneficial capabilities and features, pipelines are configured via code rendering them dynamic, and metrics have visualized graphics for Direct Acrylic Graph (DAG) and Task instances. If and when there is a failure, Airflow has the ability to rerun a DAG instance.
Apache Cassandra is a scalable and nimble multi-master database that enables failed node replacements without having to shut anything down and automatic data replication across multiple nodes. It’s a NoSQL database with high availability and scalability. It differs from the traditional RDBMS and some other NoSQL databases, in that it is designed with no master-slave structure; all nodes are peers and fault tolerant. This makes it extremely easy to scale out for more computing power without any application downtime.
Apache Carbon Data is an indexed columnar data format for incredibly fast analytics on big data platforms such as Hadoop and Spark. This new kind of file format solves the problem of querying analysis for different use cases. With Apache Carbon, the data format is unified so you can access through a single copy of data and use only the computing power needed, thus making your queries run much faster.
Apache Spark is one of the most widely utilized Apache projects and a popular choice for incredibly fast big data processing (cluster computing) with built-in capabilities for real-time data streaming, SQL, machine learning, and graph processing. Spark is optimized to run in memory and enables interactive streaming analytics so you can analyze vast amounts of historical data with live data to make real-time decisions, such as fraud detection, predictive analytics, sentiment analysis and next-best offer.
TensorFlow is an extremely popular open source library for machine intelligence which enables far more advanced analytics at scale. TensorFlow is designed for large-scale distributed training and inference but is also flexible enough to support experimentation with new machine learning models and system-level optimizations. It is very readable, well documented and expected to continue to grow into a more vibrant community.
Docker and Kubernetes are container and automated container management technologies that speed up the deployment of applications. Using technologies like containers makes your architecture extremely flexible and more portable. Your DevOps process will benefit from increased efficiencies in continuous deployment.

As impressive as each of these open projects individually, it is the collective advances that best illustrate the huge impact the open source community has had on the enterprise and the monumental shift from legacy and proprietary software to open source-based systems. Essentially, it enables firms of all sizes, across all industries to increase speed, agility, and data-driven insights at all organizational levels.

Preparing for Upcoming Open Source System Changes

While the changes that have already occurred are quite breathtaking, it is not the end of the story. There are several ways to help firms leverage the sea change that has already occurred and adapt to the innovations yet to come from the mashup of open source, cloud, and big data.

Become an open source champion in your business. Join the open source communities relative to your projects and interests. Educate yourself, your team and management on its benefits. Determine what you can leverage instead of "reinventing the wheel."
Contribute to open source projects. There are a lot of firms that use open source today, but unfortunately, many of them do not contribute. By contributing upstream to the project, others may benefit from your work, but your firm also benefits from their work. It means more feedback, new features, and greater potential for fixed issues.
Become an influencer in open source projects key to your firm. By contributing to the open source community, firms develop influence in the open source community on projects important to your firm’s progress. That influence helps you to direct changes to the project that will be of particular benefit to your firm’s projects.
Change the business culture to open source. The open source culture is open-minded, innovative and collaborative. Embracing transparency allows the team to accept the different feedback with grace, be open-minded and accepting of change.

Change has always been the only constant in human existence and business. But change is happening faster now than at any other time in history. By staying open-minded, attuned to open source, and aware of the many ways to use data and analytics, you’ll be well prepared for whatever pops up next on the horizon.

This contributed article is authored by Jason Bissell, General Manager of Asia Pacific and Japan and Calvin Hoon, Regional VP of Sales, Asia Pacific at Talend. The views and opinions expressed in this article are those of the author and do not necessarily reflect those of CDOTrends.