Month: May 2013

Parsing HTML pages with Jsoup

Introduction

Recently I had an interest to retrieve contents from an HTML web page. Few suggestions on the project also led me to find on this area. Thinking even more, in order to keep the blog updated I thought of writing on the most interesting solution I found.

jsoup

Jsoup

Through the options I found, Jsoup has some powerful capabilities in extracting data from HTML pages. You can use regex kind of expressions to filter-out the elements in a HTML page. Through the following example you will see how to take the ‘body’ section of a web page and how to download the images in a HTML page.

How To Do It

If you are doing this in an IDE like Netbeans, you have to add Jsoup jar file to Libraries.

jsoup lib

Thereafter you can start coding. As I explained previously you will get the body text and images in a HTML page from the following code. I have explained the code using comments as far as I can. 🙂


import java.io.BufferedInputStream;
import java.io.ByteArrayOutputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

/**
 *
 * JSoup HTML parser
 *
 * @author BUDDHIMA
 */
public class JSOUP {

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) {
        try {

            /**
             * Web page URL you want to connect.
             * If lesser timeout, operation may fails
             */
            Document doc = Jsoup.connect("https://buddhimawijeweera.wordpress.com/2013/05/18/parsing-html-pages-with-jsoup/").timeout(300000).get();

            // Print out text contain in <body> section
            System.out.println(doc.body().text());

            // Take image urls
            Elements images = doc.select("img[src~=(?i)\\.(png|jpe?g|gif)]");

            // Image counter
            int i = 0;

            for (Element image : images) {
                try {
                    System.out.println("src: " + image.attr("src"));
                    String src = image.attr("src");

                    // Read images
                    URL url = new URL(src);
                    InputStream in = new BufferedInputStream(url.openStream());
                    ByteArrayOutputStream out = new ByteArrayOutputStream();
                    byte[] buf = new byte[1024];
                    int n = 0;
                    while (-1 != (n = in.read(buf))) {
                        out.write(buf, 0, n);
                    }
                    out.close();
                    in.close();
                    byte[] response = out.toByteArray();

                    // Save images
                    FileOutputStream fos = new FileOutputStream("borrowed_image-" + i + ".jpg");
                    fos.write(response);
                    fos.close();
                    i++;

                } catch (Exception e) {
                    System.out.println("Error in reading & storing images: "+e.getMessage());
                }
            }

        } catch (IOException ex) {
            System.out.println("Error: " + ex.getMessage());

        }
    }
}

Conclusion

Hopefully through the above example you can get a basic idea about Jsoup, but there’s a lot more you can do with Jsoup. References will definitely help you to go ahead 🙂

References

[1] Jsoup official page: http://jsoup.org/

[2] Hello World Examples: http://www.mkyong.com/java/jsoup-html-parser-hello-world-examples/

[3] Download images (Stackoverflow Q & A): http://stackoverflow.com/questions/5882005/how-to-download-image-from-any-web-page-in-java

Advertisements