Where do you begin with Big Data in the enterprise? The number of available tools is staggering and if your company hasn't had to deal with Big Data before it's all too easy to get the wrong tool for the job. But like most enterprise software problems, a bit of perspective on what Big Data's parts are and the tools for taking care of each part can serve as a useful introduction.
We all know what data is, and people are generating a massive amount every day. Big Data came about because when companies started collecting data they quickly learned that traditional storage methods cannot store or process all that data. So data scientists came up with new ways to do just that. But if you're completely new to Big Data, the first place to start isn't with software. It's with education. Take this free Udemy course on Big Data and Hadoop, here, then come back to this article.
Big Data Tools
Big Data tools come in several large categories:
Data collection (sometimes called data extraction): Getting the data in the first place, i.e. from a webpage.
Data storage: A method to store the data into Big Data-compatible databases like Hadoop and NoSQL.
Data cleaning: Any data set will need to be check for bad entries. Data cleaning tools help organise and sort your data so that when you start analysis you will get accurate answers.
Data mining and analysis: With a clean database, these tools look through the data for hidden patterns and then project those patterns into the future so your enterprise can make business decisions based on that data.
Data visualisation: These help break the spreadsheets and data outputs from the previous tools into attractive forms to help people grasp what the tools have discovered.
Data integration: These tools integrate all the other tools above into one package, or at least helps them to communicate with one another.
For enterprises that are new to Big Data, it's probably not necessary to send a bunch of programmers out to learn Big Data programming languages so you can write your own tools. There are plenty of open-source and commercial options available that can help you build a Big Data technology suite step-by-step. Some software companies provide packages that overlap in these areas. We recommend studying these options and others carefully to choose the best suite for your enterprise.
For enterprises that are looking at all-in-one solutions to handle Big Data, there's always Oracle. Oracle may be the best choice if your data warehouse already uses Oracle software and you have Oracle DBAs on staff. But if you're not locked into Oracle, you may want to consider Talend and Splunk as alternatives. These all-in-one tools have a lot of upsides, but they do have one big downside. They're quite expensive.
Your collection software needs to be able to grab the information sources that you want to study and put them in a format that your analytics tools can use. Each of the tools above are well-regarded for collecting information from webpages. UiPath is the best one for enterprise-ready use, but if you have a savvy programmer then Screen Scraper could provide more power. Import.io is good alternate to either.
Paradoxically, your big data database solution might be the last thing you need to implement. It really depends on how much data you have and the types of data you need stored. Your data base administrators should know about systems like Hadoop, MongoDB, and other NoSQL and document-oriented database types, you don't need to have enough data to require a Big Data storage system before you can get the rest of your infrastructure in place. Another alternative is to use Google, AWS, or Microsoft cloud systems and let them handle all the storage while you work on the other tools, though depending on your industry this may not be the best option.
No Big Data tool will work if your data is messy, and that's where dedicated cleaning tools come in. The two to check out are OpenRefine and DataCleaner. The first one is a quality open source product (it was originally made by Google) and the second is a commercial product. Both will do the job though.
Mining and Analysis
The tools for this category fall across a wide spectrum, and could prove the trickiest of solutions to find. If you've ever heard the term predictive intelligence, that's what these tools are meant to support. They dig through the data for patterns and then make predictions based on those patterns.
If you are using one of the big three cloud providers, start with Qubole. Their tools are enterprise-ready and are specially designed to link into these providers. Oracle also sells solutions specifically for data analysis, as does IBM. Another solution that is enterprise-trusted is Teradata, used by companies like Wells Fargo and Siemens.
There are also a huge variety of specialty collection tools, like CONCURED itself. CONCURED scans the internet to find out what people are talking about so content marketers can predict what to write about next. If you want to reach out to the data scientist community and offer them your particularly thorny problem, check out Kaggle.
Many of the above packages will have some level of data visualisation built in, but sometimes a specific visualisation tool is needed. Real-time visualisation or specialty data formats like maps are situations where you may need a dedicated tool. One example of a visualisation suite is Tableau.
Building a Big Data suite is a step-by-step process, but the tools for handling large data sets are getting easier to use over time. If your enterprise needs to turn to Big Data for the next leap in its evolution, take some time to look through the tools in this article and start your research process now.