This blog will cover cover some new feature which Hadoop V3 has to offer for existing or new hadoop customers. And it’s a nice idea to familiarize yourself with these features ,so that incase you want to move to Hadoop or upgrade your cluster from older version you will be aware what you can try and experiment with your cluster!!!
I will be covering installation and upgradation to hadoop v3 in separate blogs as this one has a strict focus area towards features of Hv3.
So, let’s have a look at the history of Hadoop version 3 which was released end of last year on 13-December-2017. What a nice Christmas surprise to the community!!! All thanks to the dedicated hard working committers for their dedication to make this happen.
As per Apache Hadoop website the timeline of v3 version looks like this:
And the progress chart of Hadoop v3 looks like this:
After four alpha releases and one beta release, 3.0.0 is generally available. 3.0.0 consists of 302 bug fixes, improvements, and other enhancements since 3.0.0-beta1. All together, 6242 issues were fixed as part of the 3.0.0 release series since 2.7.0.
If you are more keen in details about the JIRA reported and addressed then you can have a look at below provided link:
The salient features of Hadoop v3:
As we have already taken a look at the history, let me jot down some features introduced as part of this new release :
- Minimum required Java version increased from Java 7 to Java 8
- Support for erasure coding in HDFS
- YARN Timeline Service v.2
- Shell script rewrite
- Shaded client jars
- Support for Opportunistic Containers and Distributed Scheduling
- MapReduce task-level native optimization
- Support for more than 2 NameNodes
- Default ports of multiple services have been changed
- Support for Microsoft Azure Data Lake and Aliyun Object Storage System filesystem connectors
- Intra-datanode balancer
- Reworked daemon and task heap management
- S3Guard: Consistency and Metadata Caching for the S3A filesystem client
- HDFS Router-Based Federation
- API-based configuration of Capacity Scheduler queue configuration
- YARN Resource Types
Now, I will be cover details of the features which are part of my favourite list and would help readers to understand it technically. Note: At this point I can’t cover in depth details of each feature as this will make blog clumsy and boring which I don’t want at all.
- Hadoop Erasure Coding: Erasure coding is a method for durably storing data with significant space savings compared to replication. Standard encodings like Reed-Solomon (10,4) have a 1.4x space overhead, compared to the 3x overhead of standard HDFS replication.Since erasure coding imposes additional overhead during reconstruction and performs mostly remote reads, it has traditionally been used for storing colder, less frequently accessed data. Users should consider the network and CPU overheads of erasure coding when deploying this feature.To understand more on this feature you can refer to listed link:
- Namenode HA with more than 2 nodes: In this feature customer can have more than two namenodes as Active/Passive node. In earlier release we had HA namenode which is a Active/Passive method of implementation with only one namenode failure tolerance. In this new feature to achieve higher degree of tolerance customer can implement HA for namenode with having more than two namenodes and quorum general manager for fencing.
- Changes in default ports of multiple services: With this feature hadoop services such as NameNode, Secondary NameNode, DataNode, and KMS ports are now moved out of Linux ephemeral port range (32768-61000). In earlier version having these services ports in ephemeral port range sometimes conflicts with other application and create problem in service startups.
- Intra-datanode balancer: Remember the below command for balancing the hadoop cluster when we add new datanodes to our cluster or to achieve more admin specific tasks in cluster. However, adding or replacing disks can lead to significant skew within a DataNode. This situation was not handled by the earlier version of hadoop HDFS balancer utility, which concerns itself with inter-, not intra-, DN skew. In new feature this is been taken care and can handle inter-balancing in datanodes.
- HDFS Router-Based Federation: HDFS Router-Based Federation adds a RPC routing layer that provides a federated view of multiple HDFS namespaces. This is similar to the existing ViewFs) and HDFS Federation functionality, except the mount table is managed on the server-side by the routing layer rather than on the client.
- Yarn Timeline v2 service: Timeline v2 addresses two major challenges: improving scalability and reliability of Timeline Service, and enhancing usability by introducing flows and aggregation which were lacking in earlier version.
- Yarn Resources types: In this feature user defined countable resources is enabled using which a hadoop cluster admin can define the countable resources like GPU, S/W licenses or locally attached storage. This also includes the CPU and memory which was part of earlier releases.
Tools/Information used for writing this blog: