In this post, hadoop development professionals are sharing guide that helps in detecting un-process data files in HDFS system with Python, Pig, and HBase. You can read this post and find the way to detect such files.
In the real big data application, we need to process the data hourly or daily. Therefore, we need a solution to detect which data file we already process, this work will reduce process time and we will not have the duplication in our processed data set.
Java: JDK 1.7
Cloudera version: CDH5.4.7, please refer to this link: http://www.cloudera.com/downloads/cdh/5-4-7.html
1. We need to prepare some input data files, open the file with vi tool to create a local file:
2. We need to put the local files to Hadoop Distributed File System (HDFS), use this command:
hadoop fs -mkdir -p /data/mydata/sample
hadoop fs -put file1 /data/mydata/sample/
hadoop fs -put file2 /data/mydata/sample/
3. Create the HBase table to store the last processed time.
Code walk through
This is pig script to load the data and just simple dump the data to command line to prove that we only process new files.
Note: Please note that this pig script will compile to Map Reduce Job to store the data to HBase in parallel.
2. At the first time, when we put two sample files, you can see Pig only pick two files to process and dump the output data.
3. You can verify the lastest processed time in HBase by this command:
4. Once we processed the data, we will run the code again without any new file, we will got the message like the picture below and the script will be stopped:
5. Now we need to verify the new file coming as “file3”, we can create a new file and put to HDFS with same steps in “Initial Steps” part
6. Now we can run the script again at the step 1, we will see that only file3 will be processed in pig script.
7. The structure of project should be like this, we need two bold file in the picture:
hadoop development experts hope this guide will help in detecting un-process files in HDFS with Python, HBase, and Pig. You can also read about Pig, HBase, and Python to understand the process better. If you have any query, ask them and clear your doubts.
We hope this blog will help you guys can detect un-process files to process. We will not have duplication data when processing raw data.
Technoligent is prominent name in IT industry having stable team of resources working on various technologies from decades. Our full fledged team of resources assures experts solutions for top IT industries in India and USA. Today, we are giving quality IT outsourcing services to our clients with wonderful exposure to different IT domains.
We work together with Technoligent in a mobile Project. It was not easy to develop, but the Team of Technoligent was very professional and we did a perfect solution for the european Sailors. Now we look forward to realise the next Projects together... Moreby Jorg B (Austria)
We found a reliable and efficient partner in Technoligent, ready to support our strategy of offshore software development. Technoligent has particularly impressed by the speed of response to any request for the development of software applications and for their maintenance... Moreby Fabio Durso