Blocks are physical division and input splits are logical division. One input split can be map to multiple physical blocks. When Hadoop submits jobs, it splits the input data logically and process by each Mapper task. The number of Mappers are equal to the number of splits. One important thing to remember is that InputSplit doesn’t contain actual data but a reference (storage locations) to the data.
A split basically has 2 things :
a length in bytes and a set of storage locations, which are just hostname strings.
Block size and split size is customizable. Default block size is 64Mb and default split size is equal to block size.
1 data set = 1….n files = 1….n blocks for each file
1 mapper = 1 input split = 1….n blocks
InputFormat.getSplits() is responsible for generating the input splits which are going to be used each split as input for each mapper.
By default this class is going to create one input split for each HDFS block.