The Architecture of Open Source Application

The Hadoop Distributed File System (HDFS) is designed to store verylarge data sets reliably, and to stream those data sets at highbandwidth to user applications. In a large cluster, thousands ofservers both host directly attached storage and execute userapplication tasks. By distributing storage and computation across manyservers, the resource can grow with demand while remaining economicalat every size. We describe the architecture of HDFS and report onexperience using HDFS to manage 40 petabytes of enterprise data atYahoo!

8.1. Introduction

Hadoop1 provides a distributedfilesystem and a framework for the analysis and transformation ofvery large data sets using the MapReduce [DG04]paradigm. While the interface to HDFS is patterned after the Unixfilesystem, faithfulness to standards was sacrificed in favor ofimproved performance for the applications at hand.

An important characteristic of Hadoop is the partitioning of data andcomputation across many (thousands) of hosts, and the execution ofapplication computations in parallel close to their data. A Hadoopcluster scales computation capacity, storage capacity and I/Obandwidth by simply adding commodity servers. Hadoop clusters atYahoo! span 40,000 servers, and store 40 petabytes of applicationdata, with the largest cluster being 4000 servers. One hundred otherorganizations worldwide report using Hadoop.

HDFS stores filesystem metadata and application data separately. Asin other distributed filesystems, like PVFS [CIRT00],Lustre2,and GFS [GGL03],HDFS stores metadata on adedicated server, called the NameNode. Application data are stored onother servers called DataNodes. All servers are fully connected andcommunicate with each other using TCP-based protocols. Unlike Lustreand PVFS, the DataNodes in HDFS do not rely on data protection mechanismssuch as RAID to make the data durable. Instead, like GFS, the filecontent is replicated on multiple DataNodes for reliability. Whileensuring data durability, this strategy has the added advantage thatdata transfer bandwidth is multiplied, and there are moreopportunities for locating computation near the needed data.

8.2. Architecture

8.2.1. NameNode

The HDFS namespace is a hierarchy of files and directories. Files anddirectories are represented on the NameNode by inodes. Inodes recordattributes like permissions, modification and access times, namespaceand disk space quotas. The file content is split into large blocks(typically 128 megabytes, but user selectable file-by-file), and eachblock of the file is independently replicated at multiple DataNodes(typically three, but user selectable file-by-file). The NameNodemaintains the namespace tree and the mapping of blocks to DataNodes. The current design has a singleNameNode for each cluster. The cluster can have thousands of DataNodesand tens of thousands of HDFS clients per cluster, as each DataNodemay execute multiple application tasks concurrently.

8.2.2. Image and Journal

The inodes and the list of blocks that define the metadata of the namesystem are called the image. NameNode keeps the entire namespace imagein RAM. The persistent record of the image stored in the NameNode'slocal native filesystem is called a checkpoint. The NameNode recordschanges to HDFS in a write-ahead log called the journal inits local native filesystem. The location of block replicas are notpart of the persistent checkpoint.

Each client-initiated transaction is recorded in the journal, and thejournal file is flushed and synced before the acknowledgment is sentto the client. The checkpoint file is never changed by the NameNode;a new file is written when a checkpoint is created duringrestart, when requested by the administrator, or by the CheckpointNodedescribed in the next section. During startup the NameNode initializesthe namespace image from the checkpoint, and then replays changes fromthe journal. A new checkpoint and an empty journal are written back tothe storage directories before the NameNode starts serving clients.

For improved durability, redundant copies of the checkpoint andjournal are typically stored on multiple independent local volumes andat remote NFS servers. The first choice prevents loss from a singlevolume failure, and the second choice protects against failure of theentire node. If the NameNode encounters an error writing the journalto one of the storage directories it automatically excludes thatdirectory from the list of storage directories. The NameNodeautomatically shuts itself down if no storage directory is available.

The NameNode is a multithreaded system and processes requestssimultaneously from multiple clients. Saving a transaction to diskbecomes a bottleneck since all other threads need to wait until thesynchronous flush-and-sync procedure initiated by one of them iscomplete. In order to optimize this process, the NameNode batchesmultiple transactions. When one of the NameNode's threads initiates aflush-and-sync operation, all the transactions batched at that timeare committed together. Remaining threads only need to check thattheir transactions have been saved and do not need to initiate aflush-and-sync operation.

8.2.3. DataNodes

Each block replica on a DataNode is represented by two files in thelocal native filesystem. The first file contains the data itself andthe second file records the block's metadata including checksums forthe data and the generation stamp. The size of the data file equalsthe actual length of the block and does not require extra space toround it up to the nominal block size as in traditionalfilesystems. Thus, if a block is half full it needs only half of thespace of the full block on the local drive.

During startup each DataNode connects to the NameNode and performs ahandshake. The purpose of the handshake is to verify the namespace IDand the software version of the DataNode. If either does not matchthat of the NameNode, the DataNode automatically shuts down.

The namespace ID is assigned to the filesystem instance when it isformatted. The namespace ID is persistently stored on all nodes of thecluster. Nodes with a different namespace ID will not be able to jointhe cluster, thus protecting the integrity of the filesystem. ADataNode that is newly initialized and without any namespace ID ispermitted to join the cluster and receive the cluster's namespace ID.

After the handshake the DataNode registers with theNameNode. DataNodes persistently store their unique storage IDs. Thestorage ID is an internal identifier of the DataNode, which makes itrecognizable even if it is restarted with a different IP address orport. The storage ID is assigned to the DataNode when it registerswith the NameNode for the first time and never changes after that.

A DataNode identifies block replicas in its possession to the NameNodeby sending a block report. A block report contains the block ID, thegeneration stamp and the length for each block replica the serverhosts. The first block report is sent immediately after the DataNoderegistration. Subsequent block reports are sent every hour and providethe NameNode with an up-to-date view of where block replicas arelocated on the cluster.

During normal operation DataNodes send heartbeats to the NameNode toconfirm that the DataNode is operating and the block replicas it hostsare available. The default heartbeat interval is three seconds. If theNameNode does not receive a heartbeat from a DataNode in ten minutesthe NameNode considers the DataNode to be out of service and the blockreplicas hosted by that DataNode to be unavailable. The NameNode thenschedules creation of new replicas of those blocks on other DataNodes.

Heartbeats from a DataNode also carry information about total storagecapacity, fraction of storage in use, and the number of data transferscurrently in progress. These statistics are used for the NameNode'sblock allocation and load balancing decisions.

The NameNode does not directly send requests to DataNodes. It usesreplies to heartbeats to send instructions to the DataNodes. Theinstructions include commands to replicate blocks to other nodes,remove local block replicas, re-register and send an immediate blockreport, and shut down the node.

These commands are important for maintaining the overall systemintegrity and therefore it is critical to keep heartbeats frequenteven on big clusters. The NameNode can process thousands of heartbeatsper second without affecting other NameNode operations.

8.2.4. HDFS Client

User applications access the filesystem using the HDFS client, alibrary that exports the HDFS filesystem interface.

Like most conventional filesystems, HDFS supports operations to read,write and delete files, and operations to create and deletedirectories. The user references files and directories by paths in thenamespace. The user application does not need to know that filesystemmetadata and storage are on different servers, or that blocks havemultiple replicas.

When an application reads a file, the HDFS client first asks theNameNode for the list of DataNodes that host replicas of the blocks ofthe file. The list is sorted by the network topology distance from the client. The clientcontacts a DataNode directly and requests thetransfer of the desired block. When a client writes, it first asks theNameNode to choose DataNodes to host replicas of the first block ofthe file. The client organizes a pipeline from node-to-node and sendsthe data. When the first block is filled, the client requests newDataNodes to be chosen to host replicas of the next block. A newpipeline is organized, and the client sends the further bytes of thefile. Choice of DataNodes for each block is likely to bedifferent. The interactions among the client, the NameNode and theDataNodes are illustrated in Figure 8.1.

[HDFS Client Creates a New File]

Figure 8.1: HDFS Client Creates a New File

Unlike conventional filesystems, HDFS provides an API that exposesthe locations of a file blocks. This allows applications like theMapReduce framework to schedule a task to where the data are located,thus improving the read performance. It also allows an application toset the replication factor of a file. By default a file's replicationfactor is three. For critical files or files which are accessed veryoften, having a higher replication factor improves tolerance againstfaults and increases read bandwidth.

8.2.5. CheckpointNode

The NameNode in HDFS, in addition to its primary role serving clientrequests, can alternatively execute either of two other roles, eithera CheckpointNode or a BackupNode. The role is specified at the nodestartup.

The CheckpointNode periodically combines the existing checkpoint andjournal to create a new checkpoint and an empty journal. TheCheckpointNode usually runs on a different host from the NameNodesince it has the same memory requirements as the NameNode. Itdownloads the current checkpoint and journal files from the NameNode,merges them locally, and returns the new checkpoint back to theNameNode.

Creating periodic checkpoints is one way to protect the filesystemmetadata. The system can start from the most recent checkpoint if allother persistent copies of the namespace image or journal areunavailable. Creating a checkpoint also lets the NameNode truncate thejournal when the new checkpoint is uploaded to the NameNode. HDFSclusters run for prolonged periods of time without restarts duringwhich the journal constantly grows. If the journal grows very large,the probability of loss or corruption of the journal fileincreases. Also, a very large journal extends the time required torestart the NameNode. For a large cluster, it takes an hour to processa week-long journal. Good practice is to create a daily checkpoint.

8.2.6. BackupNode

A recently introduced feature of HDFS is the BackupNode. Like aCheckpointNode, the BackupNode is capable of creating periodiccheckpoints, but in addition it maintains an in-memory, up-to-dateimage of the filesystem namespace that is always synchronized withthe state of the NameNode.

The BackupNode accepts the journal stream of namespace transactionsfrom the active NameNode, saves them in journal on its own storagedirectories, and applies these transactions to its own namespace imagein memory. The NameNode treats the BackupNode as a journal store thesame way as it treats journal files in its storage directories. If theNameNode fails, the BackupNode's image in memory and the checkpoint ondisk is a record of the latest namespace state.

The BackupNode can create a checkpoint without downloading checkpointand journal files from the active NameNode, since it already has anup-to-date namespace image in its memory. This makes the checkpointprocess on the BackupNode more efficient as it only needs to save thenamespace into its local storage directories.

The BackupNode can be viewed as a read-only NameNode. It contains allfilesystem metadata information except for block locations. It canperform all operations of the regular NameNode that do not involvemodification of the namespace or knowledge of block locations. Use ofa BackupNode provides the option of running the NameNode withoutpersistent storage, delegating responsibility of persisting thenamespace state to the BackupNode.

8.2.7. Upgrades and Filesystem Snapshots

During software upgrades the possibility of corrupting the filesystemdue to software bugs or human mistakes increases. The purpose ofcreating snapshots in HDFS is to minimize potential damage to the datastored in the system during upgrades.

The snapshot mechanism lets administrators persistently save thecurrent state of the filesystem, so that if the upgrade results indata loss or corruption it is possible to rollback the upgrade andreturn HDFS to the namespace and storage state as they were at thetime of the snapshot.

The snapshot (only one can exist) is created at the clusteradministrator's option whenever the system is started. If a snapshotis requested, the NameNode first reads the checkpoint and journalfiles and merges them in memory. Then it writes the new checkpoint andthe empty journal to a new location, so that the old checkpoint andjournal remain unchanged.

During handshake the NameNode instructs DataNodes whether to create alocal snapshot. The local snapshot on the DataNode cannot be createdby replicating the directories containing the data files as this would requiredoubling the storage capacity of every DataNode on thecluster. Instead each DataNode creates a copy of the storage directoryand hard links existing block files into it. When the DataNode removesa block it removes only the hard link, and block modifications duringappends use the copy-on-write technique. Thus old block replicasremain untouched in their old directories.

The cluster administrator can choose to roll back HDFS to the snapshotstate when restarting the system. The NameNode recovers the checkpointsaved when the snapshot was created. DataNodes restore the previouslyrenamed directories and initiate a background process to delete blockreplicas created after the snapshot was made. Having chosen to rollback, there is no provision to roll forward. The cluster administratorcan recover the storage occupied by the snapshot by commanding thesystem to abandon the snapshot; for snapshots created during upgrade,this finalizes the software upgrade.

System evolution may lead to a change in the format of the NameNode'scheckpoint and journal files, or in the data representation of blockreplica files on DataNodes. The layout version identifies the datarepresentation formats, and is persistently stored in the NameNode'sand the DataNodes' storage directories. During startup each nodecompares the layout version of the current software with the versionstored in its storage directories and automatically converts data fromolder formats to the newer ones. The conversion requires the mandatorycreation of a snapshot when the system restarts with the new softwarelayout version.

8.3. File I/O Operations and Replica Management

Of course, the whole point of a filesystem is to store data infiles. To understand how HDFS does this, we must look at how readingand writing works, and how blocks are managed.

8.3.1. File Read and Write

An application adds data to HDFS by creating a new file and writingthe data to it. After the file is closed, the bytes written cannot bealtered or removed except that new data can be added to the file byreopening the file for append. HDFS implements a single-writer,multiple-reader model.

The HDFS client that opens a file for writing is granted a lease forthe file; no other client can write to the file. The writing clientperiodically renews the lease by sending a heartbeat to theNameNode. When the file is closed, the lease is revoked. The leaseduration is bound by a soft limit and a hard limit. Until the softlimit expires, the writer is certain of exclusive access to thefile. If the soft limit expires and the client fails to close the fileor renew the lease, another client can preempt the lease. If after thehard limit expires (one hour) and the client has failed to renew thelease, HDFS assumes that the client has quit and will automaticallyclose the file on behalf of the writer, and recover the lease. Thewriter's lease does not prevent other clients from reading the file; afile may have many concurrent readers.

An HDFS file consists of blocks. When there is a need for a new block,the NameNode allocates a block with a unique block ID and determines alist of DataNodes to host replicas of the block. The DataNodes form apipeline, the order of which minimizes the total network distance fromthe client to the last DataNode. Bytes are pushed to the pipeline as asequence of packets. The bytes that an application writes first bufferat the client side. After a packet buffer is filled (typically 64 KB),the data are pushed to the pipeline. The next packet can be pushed tothe pipeline before receiving the acknowledgment for the previouspackets. The number of outstanding packets is limited by theoutstanding packets window size of the client.

After data are written to an HDFS file, HDFS does not provide anyguarantee that data are visible to a new reader until the file isclosed. If a user application needs the visibility guarantee, it canexplicitly call the hflush operation. Then the current packet isimmediately pushed to the pipeline, and the hflush operation will waituntil all DataNodes in the pipeline acknowledge the successfultransmission of the packet. All data written before the hflushoperation are then certain to be visible to readers.

[Data Pipeline While Writing a Block]

Figure 8.2: Data Pipeline While Writing a Block

If no error occurs, block construction goes through three stages asshown in Figure 8.2 illustrating a pipeline of threeDataNodes (DN) and a block of five packets. In the picture, boldlines represent data packets, dashed lines represent acknowledgmentmessages, and thin lines represent control messages to setup and closethe pipeline. Vertical lines represent activity at the client and thethree DataNodes where time proceeds from top to bottom. Fromt0 to t1 is the pipeline setup stage. The intervalt1 to t2 is the data streaming stage, where t1 isthe time when the first data packet gets sent and t2 is thetime that the acknowledgment to the last packet gets received. Here anhflush operation transmits packet 2. The hflush indicationtravels with the packet data and is not a separate operation. Thefinal interval t2 to t3 is the pipeline close stage forthis block.

In a cluster of thousands of nodes, failures of a node (most commonlystorage faults) are daily occurrences. A replica stored on a DataNodemay become corrupted because of faults in memory, disk, or network.HDFS generates and stores checksums for each data block of an HDFSfile. Checksums are verified by the HDFS client while reading to helpdetect any corruption caused either by client, DataNodes, ornetwork. When a client creates an HDFS file, it computes the checksumsequence for each block and sends it to a DataNode along with thedata. A DataNode stores checksums in a metadata file separate from theblock's data file. When HDFS reads a file, each block's data andchecksums are shipped to the client. The client computes the checksumfor the received data and verifies that the newly computed checksumsmatches the checksums it received. If not, the client notifies theNameNode of the corrupt replica and then fetches a different replicaof the block from another DataNode.

When a client opens a file to read, it fetches the list of blocks andthe locations of each block replica from the NameNode. The locationsof each block are ordered by their distance from the reader. Whenreading the content of a block, the client tries the closest replicafirst. If the read attempt fails, the client tries the next replica insequence. A read may fail if the target DataNode is unavailable, thenode no longer hosts a replica of the block, or the replica is foundto be corrupt when checksums are tested.

HDFS permits a client to read a file that is open for writing. Whenreading a file open for writing, the length of the last block stillbeing written is unknown to the NameNode. In this case, the clientasks one of the replicas for the latest length before starting to readits content.

The design of HDFS I/O is particularly optimized for batch processingsystems, like MapReduce, which require high throughput for sequentialreads and writes. Ongoing efforts will improve read/write responsetime for applications that require real-time data streaming or randomaccess.

8.3.2. Block Placement

For a large cluster, it may not be practical to connect all nodes in aflat topology. A common practice is to spread the nodes acrossmultiple racks. Nodes of a rack share a switch, and rack switches areconnected by one or more core switches. Communication between twonodes in different racks has to go through multiple switches. In mostcases, network bandwidth between nodes in the same rack is greaterthan network bandwidth between nodes in different racks.Figure 8.3 describes a cluster with two racks, each ofwhich contains three nodes.

[Cluster Topology]

Figure 8.3: Cluster Topology

HDFS estimates the network bandwidth between two nodes by theirdistance. The distance from a node to its parent node is assumed to beone. A distance between two nodes can be calculated by summing thedistances to their closest common ancestor. A shorter distancebetween two nodes means greater bandwidth they can use to transferdata.

HDFS allows an administrator to configure a script that returns anode's rack identification given a node's address. The NameNode is thecentral place that resolves the rack location of each DataNode. When aDataNode registers with the NameNode, the NameNode runs the configuredscript to decide which rack the node belongs to. If no such a scriptis configured, the NameNode assumes that all the nodes belong to adefault single rack.

The placement of replicas is critical to HDFS data reliability andread/write performance. A good replica placement policy should improvedata reliability, availability, and network bandwidthutilization. Currently HDFS provides a configurable block placementpolicy interface so that the users and researchers can experiment andtest alternate policies that are optimal for their applications.

The default HDFS block placement policy provides a tradeoff betweenminimizing the write cost, and maximizing data reliability,availability and aggregate read bandwidth. When a new block iscreated, HDFS places the first replica on the node where the writer islocated. The second and the third replicas are placed on two differentnodes in a different rack. The rest are placed on random nodes withrestrictions that no more than one replica is placed at any one nodeand no more than two replicas are placed in the same rack, ifpossible. The choice to place the second and third replicas on adifferent rack better distributes the block replicas for a single fileacross the cluster. If the first two replicas were placed on the samerack, for any file, two-thirds of its block replicas would be on thesame rack.

After all target nodes are selected, nodes are organized as a pipelinein the order of their proximity to the first replica. Data are pushedto nodes in this order. For reading, the NameNode first checks if theclient's host is located in the cluster. If yes, block locations arereturned to the client in the order of its closeness to thereader. The block is read from DataNodes in this preference order.

This policy reduces the inter-rack and inter-node write traffic andgenerally improves write performance. Because the chance of a rackfailure is far less than that of a node failure, this policy does notimpact data reliability and availability guarantees. In the usual caseof three replicas, it can reduce the aggregate network bandwidth usedwhen reading data since a block is placed in only two unique racksrather than three.

8.3.3. Replication Management

The NameNode endeavors to ensure that each block always has theintended number of replicas. The NameNode detects that a block hasbecome under- or over-replicated when a block report from a DataNodearrives. When a block becomes over replicated, the NameNode chooses areplica to remove. The NameNode will prefer not to reduce the numberof racks that host replicas, and secondly prefer to remove a replicafrom the DataNode with the least amount of available disk space. Thegoal is to balance storage utilization across DataNodes withoutreducing the block's availability.

When a block becomes under-replicated, it is put in the replicationpriority queue. A block with only one replica has the highestpriority, while a block with a number of replicas that is greater thantwo thirds of its replication factor has the lowest priority. Abackground thread periodically scans the head of the replication queueto decide where to place new replicas. Block replication follows asimilar policy as that of new block placement. If the number ofexisting replicas is one, HDFS places the next replica on a differentrack. In case that the block has two existing replicas, if the twoexisting replicas are on the same rack, the third replica is placed ona different rack; otherwise, the third replica is placed on adifferent node in the same rack as an existing replica. Here the goalis to reduce the cost of creating new replicas.

The NameNode also makes sure that not all replicas of a block arelocated on one rack. If the NameNode detects that a block's replicasend up at one rack, the NameNode treats the block as mis-replicatedand replicates the block to a different rack using the same blockplacement policy described above. After the NameNode receives thenotification that the replica is created, the block becomesover-replicated. The NameNode then will decides to remove an oldreplica because the over-replication policy prefers not to reduce thenumber of racks.

8.3.4. Balancer

HDFS block placement strategy does not take into account DataNode diskspace utilization. This is to avoid placing new—more likely to bereferenced—data at a small subset of the DataNodes with a lot offree storage. Therefore data might not always be placed uniformlyacross DataNodes. Imbalance also occurs when new nodes are added tothe cluster.

The balancer is a tool that balances disk space usage on an HDFScluster. It takes a threshold value as an input parameter, which is afraction between 0 and 1. A cluster is balanced if, for each DataNode,the utilization of the node3 differs from theutilization of the whole cluster4 by no morethan the threshold value.

The tool is deployed as an application program that can be run by thecluster administrator. It iteratively moves replicas from DataNodeswith higher utilization to DataNodes with lower utilization. One keyrequirement for the balancer is to maintain data availability. Whenchoosing a replica to move and deciding its destination, the balancerguarantees that the decision does not reduce either the number ofreplicas or the number of racks.

The balancer optimizes the balancing process by minimizing theinter-rack data copying. If the balancer decides that a replica Aneeds to be moved to a different rack and the destination rack happensto have a replica B of the same block, the data will be copied fromreplica B instead of replica A.

A configuration parameter limits the bandwidth consumed by rebalancingoperations. The higher the allowed bandwidth, the faster a cluster canreach the balanced state, but with greater competition withapplication processes.

8.3.5. Block Scanner

Each DataNode runs a block scanner that periodically scans its blockreplicas and verifies that stored checksums match the block data. Ineach scan period, the block scanner adjusts the read bandwidth inorder to complete the verification in a configurable period. If aclient reads a complete block and checksum verification succeeds, itinforms the DataNode. The DataNode treats it as a verification of thereplica.

The verification time of each block is stored in a human-readable logfile. At any time there are up to two files in the top-level DataNodedirectory, the current and previous logs. New verification times areappended to the current file. Correspondingly, each DataNode has anin-memory scanning list ordered by the replica's verification time.

Whenever a read client or a block scanner detects a corrupt block, itnotifies the NameNode. The NameNode marks the replica as corrupt, butdoes not schedule deletion of the replica immediately. Instead, itstarts to replicate a good copy of the block. Only when the goodreplica count reaches the replication factor of the block the corruptreplica is scheduled to be removed. This policy aims to preserve dataas long as possible. So even if all replicas of a block are corrupt,the policy allows the user to retrieve its data from the corruptreplicas.

8.3.6. Decommissioning

The cluster administrator specifies list of nodes to bedecommissioned. Once a DataNode is marked for decommissioning, itwill not be selected as the target of replica placement, but it willcontinue to serve read requests. The NameNode starts to schedulereplication of its blocks to other DataNodes. Once the NameNodedetects that all blocks on the decommissioning DataNode arereplicated, the node enters the decommissioned state. Then it can besafely removed from the cluster without jeopardizing any dataavailability.

8.3.7. Inter-Cluster Data Copy

When working with large datasets, copying data into and out of a HDFScluster is daunting. HDFS provides a tool called DistCp for largeinter/intra-cluster parallel copying. It is a MapReduce job; each ofthe map tasks copies a portion of the source data into the destinationfilesystem. The MapReduce framework automatically handles paralleltask scheduling, error detection and recovery.

8.4. Practice at Yahoo!

Large HDFS clusters at Yahoo! include about 4000 nodes. A typicalcluster node has two quad core Xeon processors running at 2.5 GHz,4–12 directly attached SATA drives (holding two terabytes each), 24 Gbyte ofRAM, and a 1-gigabit Ethernet connection. Seventy percent of the diskspace is allocated to HDFS. The remainder is reserved for theoperating system (Red Hat Linux), logs, and space to spill the outputof map tasks (MapReduce intermediate data are not stored in HDFS).

Forty nodes in a single rack share an IP switch. The rack switches areconnected to each of eight core switches. The core switches provideconnectivity between racks and to out-of-cluster resources. For eachcluster, the NameNode and the BackupNode hosts are speciallyprovisioned with up to 64 GB RAM; application tasks are never assignedto those hosts. In total, a cluster of 4000 nodes has 11 PB(petabytes; 1000 terabytes) of storage available as blocks that arereplicated three times yielding a net 3.7 PB of storage for userapplications. Over the years that HDFS has been in use, the hostsselected as cluster nodes have benefited from improvedtechnologies. New cluster nodes always have faster processors, biggerdisks and larger RAM. Slower, smaller nodes are retired or relegatedto clusters reserved for development and testing of Hadoop.

On an example large cluster (4000 nodes), there are about 65 millionfiles and 80 million blocks. As each block typically is replicatedthree times, every data node hosts 60 000 block replicas. Each day,user applications will create two million new files on thecluster. The 40 000 nodes in Hadoop clusters at Yahoo! provide 40 PBof on-line data storage.

Becoming a key component of Yahoo!'s technology suite meant tacklingtechnical problems that are the difference between being a researchproject and being the custodian of many petabytes of corporate data.Foremost are issues of robustness and durability of data. But alsoimportant are economical performance, provisions for resource sharingamong members of the user community, and ease of administration by thesystem operators.

8.4.1. Durability of Data

Replication of data three times is a robust guard against loss of datadue to uncorrelated node failures. It is unlikely Yahoo! has ever losta block in this way; for a large cluster, the probability of losing ablock during one year is less than 0.005. The key understanding isthat about 0.8 percent of nodes fail each month. (Even if the node iseventually recovered, no effort is taken to recover data it may havehosted.) So for the sample large cluster as described above, a node ortwo is lost each day. That same cluster will re-create the 60 000block replicas hosted on a failed node in about twominutes: re-replication is fast because it is a parallel problem thatscales with the size of the cluster. The probability of several nodesfailing within two minutes such that all replicas of some block arelost is indeed small.

Correlated failure of nodes is a different threat. The most commonlyobserved fault in this regard is the failure of a rack or core switch.HDFS can tolerate losing a rack switch (each block has a replica onsome other rack). Some failures of a core switch can effectivelydisconnect a slice of the cluster from multiple racks, in which caseit is probable that some blocks will become unavailable. In eithercase, repairing the switch restores unavailable replicas to thecluster. Another kind of correlated failure is the accidental ordeliberate loss of electrical power to the cluster. If the loss ofpower spans racks, it is likely that some blocks will becomeunavailable. But restoring power may not be a remedy because one-halfto one percent of the nodes will not survive a full power-on restart.Statistically, and in practice, a large cluster will lose a handful ofblocks during a power-on restart.

In addition to total failures of nodes, stored data can be corruptedor lost. The block scanner scans all blocks in a large cluster eachfortnight and finds about 20 bad replicas in the process. Bad replicasare replaced as they are discovered.

8.4.2. Features for Sharing HDFS

As the use of HDFS has grown, the filesystem itself has had tointroduce means to share the resource among a large number of diverse users.The first such feature was a permissions framework closelymodeled on the Unix permissions scheme for file and directories. Inthis framework, files and directories have separate access permissionsfor the owner, for other members of the user group associated with thefile or directory, and for all other users. The principle differencesbetween Unix (POSIX) and HDFS are that ordinary files in HDFS haveneither execute permissions nor sticky bits.

In the earlier version of HDFS, user identity was weak: you were whoyour host said you are. When accessing HDFS, the application clientsimply queries the local operating system for user identity and groupmembership. In the new framework, the application client must presentto the name system credentials obtained from a trustedsource. Different credential administrations are possible; the initialimplementation uses Kerberos. The user application can use the sameframework to confirm that the name system also has a trustworthyidentity. And the name system also can demand credentials from each ofthe data nodes participating in the cluster.

The total space available for data storage is set by the number ofdata nodes and the storage provisioned for each node. Early experiencewith HDFS demonstrated a need for some means to enforce the resourceallocation policy across user communities. Not only must fairness ofsharing be enforced, but when a user application might involvethousands of hosts writing data, protection against applicationsinadvertently exhausting resources is also important. For HDFS,because the system metadata are always in RAM, the size of thenamespace (number of files and directories) is also a finiteresource. To manage storage and namespace resources, each directorymay be assigned a quota for the total space occupied by files in thesub-tree of the namespace beginning at that directory. A separatequota may also be set for the total number of files and directories inthe sub-tree.

While the architecture of HDFS presumes most applications will streamlarge data sets as input, the MapReduce programming framework can havea tendency to generate many small output files (one from each reducetask) further stressing the namespace resource. As a convenience, adirectory sub-tree can be collapsed into a single Hadoop Archivefile. A HAR file is similar to a familiar tar, JAR, or Zip file, butfilesystem operations can address the individual files within thearchive, and a HAR file can be used transparently as the input to aMapReduce job.

8.4.3. Scaling and HDFS Federation

Scalability of the NameNode has been a key struggle[Shv10].Because the NameNode keeps all thenamespace and block locations in memory, the size of the NameNode heaplimits the number of files and also the number of blocksaddressable. This also limits the total cluster storage that can besupported by the NameNode. Users are encouraged to create largerfiles, but this has not happened since it would require changes inapplication behavior. Furthermore, we are seeing new classes ofapplications for HDFS that need to store a large number of smallfiles. Quotas were added to manage the usage, and an archive tool hasbeen provided, but these do not fundamentally address thescalability problem.

A new feature allows multiple independent namespaces (and NameNodes)to share the physical storage within a cluster. Namespaces use blocksgrouped under a Block Pool. Block pools are analogous to logical units (LUNs) in a SANstorage system and a namespace with its pool of blocks is analogous toa filesystem volume.

This approach offers a number of advantages besides scalability: itcan isolate namespaces of different applications improving the overallavailability of the cluster. Block pool abstraction allows otherservices to use the block storage with perhaps a different namespacestructure. We plan to explore other approaches to scaling such asstoring only partial namespace in memory, and truly distributedimplementation of the NameNode.

Applications prefer to continue using a single namespace. Namespacescan be mounted to create such a unified view. A client-side mounttable provide an efficient way to do that, compared to a server-sidemount table: it avoids an RPC to the central mount table and is alsotolerant of its failure. The simplest approach is to have sharedcluster-wide namespace; this can be achieved by giving the sameclient-side mount table to each client of the cluster. Client-sidemount tables also allow applications to create a private namespaceview. This is analogous to the per-process namespaces that are used todeal with remote execution in distributed systems[PPT+93, Rad94,RP93].

8.5. Lessons Learned

A very small team was able to build the Hadoop filesystem and make itstable and robust enough to use it in production. A large part of thesuccess was due to the very simple architecture: replicated blocks,periodic block reports and central metadata server. Avoiding the fullPOSIX semantics also helped. Although keeping the entire metadata inmemory limited the scalability of the namespace, it made the NameNodevery simple: it avoids the complex locking of typical filesystems. Theother reason for Hadoop's success was to quickly use the system forproduction at Yahoo!, as it was rapidly and incrementallyimproved. The filesystem is very robust and the NameNode rarely fails;indeed most of the down time is due to software upgrades. Onlyrecently have failover solutions (albeit manual) emerged

Many have been surprised by the choice of Java in building a scalablefilesystem. While Java posed challenges for scaling the NameNode dueto its object memory overhead and garbage collection, Java has beenresponsible to the robustness of the system; it has avoidedcorruption due to pointer or memory management bugs.

8.6. Acknowledgment

We thank Yahoo! for investing in Hadoop and continuing tomake it available as open source; 80% of the HDFS and MapReduce codewas developed at Yahoo! We thank all Hadoop committers andcollaborators for their valuable contributions.


  3. Defined as the ratio of usedspace at the node to total capacity of the node.
  4. Defined as the ratio of usedspace in the cluster to total capacity of the cluster.