Monday, May 16, 2016

Mining Massive Datasets - DIstributed File Systems


Distributed File Systems

In the last few years Map-Reduce has emerged as a leading paradigm for mining really massive data sets. But before getting into Map-Reduce, let's try to understand why we need Map-Reduce in the first place. In the basic computational model of CPU and memory, the algorithm runs on the CPU, and accesses data that's in memory.
Screen Shot 2015-09-23 at 11.41.30 AM.png
Now we may need to bring the data in from disk into memory, but once the data is in memory it fits in there fully. So it is not needed to access disk again, and the algorithm just runs in the data that's on memory. There's a familiar model that we use to implement all kinds of algorithms, and machined learning, and statistics. However, what happens when the data is so big, that it can't all fit in memory at the same time. That's where data mining and classical data mining algorithms comes in.
However, we can only bring in a portion of the data into memory at a time, process it in batches and write back results to disk. This is the realm of classical data mining algorithms. But sometimes even this is not sufficient. Example: Google - crawling and indexing the web. Screen Shot 2015-09-23 at 1.04.09 PM.png
Let's say, google has crawled 10 billion web pages and the average size of a web page is 20 KB. These are representative numbers from real life. If we take ten billion webpages, each of 20 KB, total data set size is 200 TB. Let's assume using the classical computational model, classical data mining model and all this data is stored on a single disk, and we have read tend to be processed inside a CPU. The fundamental limitation here is the bandwidth, the data bandwidth between the disk and the CPU. The data has to be read from the disk into the CPU, and the disk read bandwidth for most modern SATA disk representative number is around 50MB a second. So, we can read data at 50MB a second.

How long does it take to read 200TB at 50MB a second?
The answer is 4 million seconds which is more than 46 days. This is an awfully long time, and is just the time to read the data into memory. To do something useful with the data, it's going to take even longer. So clearly this is unacceptable. Four to six days just to read the data is not desirable.
Solution:
Now the obvious thing to do is to split the data into chunks: multiple disks and CPUs. Stripe the data across multiple disks; read it, and, and process it in parallel in multiple CPUs. This will cut down this time by a lot.

For example, for 1,000 disks and CPUs, the job can be done in 4 million by 1,000, which is 4,000 seconds. And that's just about an hour which is, which is very acceptable time.

So this is the fundamental idea behind the idea of cluster computing. The tiered architecture that has emerged for cluster computing: You have the racks consisting of commodity Linux nodes (commodity Linux nodes because they are very cheap). Rack thousands and thousands of them. Each rack has 16 to 64 (Of course 16 to 64 nodes is not sufficient.) of these nodes and these nodes are connected by a switch. Gigabit switch gives 1 Gbps bandwidth between any pair of nodes in rack. Multiple racks are connected by backbone switches. And the backbones with a higher bandwidth switch can do two to ten gigabits between racks. So we have 16 to 64 nodes in a rack and then rack of multiple racks, which is a data center. So this is the standard classical architecture that has emerged over the last few years. This is used for storing and mining very large data sets.
Screen Shot 2015-09-23 at 5.02.29 PM.png

1 comment:

  1. Out of the several data warehouse service provider, I found your company to be the topmost, which helped me optimizing the system through the capacity analysis as well as the effective performance.

    ReplyDelete