Full-Service Internet Marketing & Web Development | WebMail
Call us Toll-Free:
1-800-218-1525
Email us

Distributed Storage and the Google File System (GFS)

Mike Peters, 10-10-2007
The MySQL Database, or any relational database for that matter, typically perform great in an environment of "many reads / few writes".

You should always work on reducing the number of database writes, and remember read is a lot cheaper than write.

Most web services out there, fit perfectly into the "many reads / few writes" environment. You have tons of end-users all trying to access a (relatively) small subset of data.

A few examples include:

WordPress Blog - Infinite number of users fetching the blog posts to their browseres. Relatively low number of posts at any given point in time.

BlogRush - Infinite number of users have the widget installed on their blog. But at any given point in time, there is always a fixed small number of categories and posts to be served back.

AuctionAds - Infinite number of users have the widget installed on their site. Fixed small number of auctions are served back.

--

If you can fit your web service into this environment of "many reads / few writes" - you're in luck. It is significantly easier to optimize the performance / scalability of such services and there lots of ready-made tools for this job.

On the other hand, if your web service requires "many writes / few reads", it may be time to question your underlying storage platform.

Search Engines (like Google) and Real-time Web stats are two examples of an environment where you have many concurrent writes with a much smaller number of reads.

A search engine needs to crawl and parse (write) data about billions of pages on a daily basis, to serve a relatively small number of end-user requests.

A relational database is typically not the best choice for such an environment.

Relational databases carry an overhead involved with maintaining data integrity, avoiding duplicate records, locking tables/rows and running off a single vector processing unit. You could do replication, load balancing and switch to Oracle, but if you cannot switch your web service to a "many reads / few writes" environment, you're probably better off looking at an alternative storage platform to replace your relational database.

Google recently released a series of video tutorials and whitepapers, explaining the MapReduce algorithm they use as part of Google's Distributed File System (GFS).

Here's a link to view Google's Cluster Computing and MapReduce class.

If you don't feel like watching the 3 hour videos, here's a brief summary:

* Processing more data is going to mean using more machines at the same time, that's because Moore's law has maxed out and no longer applies (memory bandwidth issue)

* Anytime we're going to cooperate between multiple processes we need to use synchornization primitives which are problematic (deadlocks)

* Real distributed system requires consideration of networking topology and recovery when individual units fail - they will fail, by definition

* NFS - Each machine is responsible to a single file. Stateless protocol

* GFS - Master keeps meta data about what chunk servers hold which data. Data is replicated across multiple chunk servers. Files are always appended to, no random access writes. Reads are streamed. All master data is kept in memory with 64kb per chunk

* Logs are used to keep the master data in tact - Write log, then write meta data with an END token. If END not there, need to rewrite.

* Fault Tolerance - High Availablity, Fast recovery, Chunk replication (default: 3 replicas), Shadow masters. Data Integrity - Checksum every 64Kb block in each chunk

--

To summarize, relational databases are great. But they're not the end-all-be-all tool. Sometimes dumping raw data to a flat text file makes a lot more sense.

Mike Peters, 10-15-2007
To add to this article -

While using Google's GFS (proprietary) or implementing a similar mechanism from scratch (good luck with that) may be beyond your reach, Amazon S3 has been around for two years and is widely adopted.

I urge you to explore Amazon's file storage platform as a possible solution for your "many writes / few reads" challenges.

Dawn Rossi, 01-09-2008
Another thing to look at is the new MySQL NDB Cluster storage that enables clustering of in-memory databases.

It was designed to work with very inexpensive hardware, and with a minimum of specific requirements for hardware or software.

Much like Google's File System
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Enjoyed this post?

Subscribe Now to receive new posts via Email as soon as they come out.

Post your comments












Note: No link spamming! If your message contains link/s, it will NOT be published on the site before manually approved by one of our moderators.


About us  |  Contact us  |  Privacy  |  Terms & Conditions  |  Affiliates

© 2008 Software Projects Inc. (SPI)
Friday, August 29th, 2008
Page generated in 0.533 seconds