Blocks are physical division and input splits are logical division. One input split can be map to multiple physical blocks.
When Hadoop submits jobs, it splits the input data logically and process by each Mapper task.
The number of Mappers are equal to the number of splits.
One important thing to remember is that InputSplit doesn’t contain actual data but a reference (storage locations) to the data.
A split basically has 2 things :
a length in bytes and a set of storage locations, which are just hostname strings.
Block size and split size is customizable. Default block size is 64Mb and default split size is equal to block size.
1 data set = 1….n files = 1….n blocks for each file
1 mapper = 1 input split = 1….n blocks
InputFormat.getSplits() is responsible for generating the input splits which are going to be used each split as input for each mapper.
By default this class is going to create one input split for each HDFS block.
As a user, we don’t need to deal with InputSplit directly, as they are created by an InputFormat
(as InputFormat is responsible for creating the Inputsplit and dividing into records).
FileInputFormat, by default, breaks a file into 128MB chunks (same as blocks in HDFS) and by setting mapred.min.split.size
parameter in mapred-site.xml we can control this value or by overriding the parameter in the Job object used to submit a
particular MapReduce job. We can control how the file is broken up into splits, by writing a custom InputFormat.