As this is the first article (given in this blog) related to Hadoop, it should be started from basics. Today Big Data seems to be the most buzzing word around. I believe we can simply describe it as huge amount of data that holds information. But at the same time, it is not easy to extract this information out. Though, it is definitely not as simple as this definition. However, since this article is for an introduction to Hadoop, we can leave the explanation part of Big Data for now. Let’s talk about the problems associated with Big Data.
There are two major problems associated with Big Data.
- First problem is to store such a huge amount of data. We need a storage for huge amount of data which also provides a mechanism to avoid data loss.
- The main purpose of storing this huge amount of data is to get some information out of it. Hence, the second problem is regarding analysis of this data.
Hadoop can be described as a technology that deals with these 2 basic Big Data problems. Hence, Hadoop, in itself contains two different concepts:
- HDFS (Hadoop Distributed File Systems) – used for reliably storing massive amount of data
- MapReduce – for analyzing the data stored in HDFS
Next section of this article provides an introduction to these 2 concepts of Hadoop. Some of the upcoming article would try to explain these in details. Please post your queries in comment section and we would try to come up with related articles.
HDFS stands for Hadoop Distributed File System. As we have the huge amount of data, it can not fit in the storage provided by a single physical machine. We need to distribute the huge amount of data such that data files can be shared during job processing and for the same purpose DFS comes to rescue. Distributed file systems provide storage in several machines connected with each other through network. Though, it is not mandatory to use only HDFS with Hadoop, however, it is Hadoop’s flagship file system. GPFS, (General Parallel File System) which is an FS provided by IBM, for example, can also be integrated with Hadoop. IBM’s InfoSphere BigInsights provides ability to configure Hadoop with either of these filesystems.
Now that we have a distributed file system to store the data, the second problem is to process it since this is the main reason why we are storing this huge amount of data. To explain it in an easier manner, let’s assume we have a very big task to complete – finding word count in a book, for example. What is the easiest way to complete it in lesser time? I believe the answer would be dividing the problem into sub-problems. So we’ll divide the book into several set of pages and first, count words in these sets. Later, we can combine the outcome into one final result. Same is the approach being followed by MapReduce. Hence, MapReduce can be defined as a programming model that is used for processing the data. This processing is divided into two parts : map and reduce. The functions for map and reduce is defined by the programmer.