Parsing HTML pages with Jsoup

Introduction

Recently I had an interest to retrieve contents from an HTML web page. Few suggestions on the project also led me to find on this area. Thinking even more, in order to keep the blog updated I thought of writing on the most interesting solution I found.

jsoup

Jsoup

Through the options I found, Jsoup has some powerful capabilities in extracting data from HTML pages. You can use regex kind of expressions to filter-out the elements in a HTML page. Through the following example you will see how to take the ‘body’ section of a web page and how to download the images in a HTML page.

How To Do It

If you are doing this in an IDE like Netbeans, you have to add Jsoup jar file to Libraries.

jsoup lib

Thereafter you can start coding. As I explained previously you will get the body text and images in a HTML page from the following code. I have explained the code using comments as far as I can. 🙂


import java.io.BufferedInputStream;
import java.io.ByteArrayOutputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

/**
 *
 * JSoup HTML parser
 *
 * @author BUDDHIMA
 */
public class JSOUP {

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) {
        try {

            /**
             * Web page URL you want to connect.
             * If lesser timeout, operation may fails
             */
            Document doc = Jsoup.connect("https://buddhimawijeweera.wordpress.com/2013/05/18/parsing-html-pages-with-jsoup/").timeout(300000).get();

            // Print out text contain in <body> section
            System.out.println(doc.body().text());

            // Take image urls
            Elements images = doc.select("img[src~=(?i)\\.(png|jpe?g|gif)]");

            // Image counter
            int i = 0;

            for (Element image : images) {
                try {
                    System.out.println("src: " + image.attr("src"));
                    String src = image.attr("src");

                    // Read images
                    URL url = new URL(src);
                    InputStream in = new BufferedInputStream(url.openStream());
                    ByteArrayOutputStream out = new ByteArrayOutputStream();
                    byte[] buf = new byte[1024];
                    int n = 0;
                    while (-1 != (n = in.read(buf))) {
                        out.write(buf, 0, n);
                    }
                    out.close();
                    in.close();
                    byte[] response = out.toByteArray();

                    // Save images
                    FileOutputStream fos = new FileOutputStream("borrowed_image-" + i + ".jpg");
                    fos.write(response);
                    fos.close();
                    i++;

                } catch (Exception e) {
                    System.out.println("Error in reading & storing images: "+e.getMessage());
                }
            }

        } catch (IOException ex) {
            System.out.println("Error: " + ex.getMessage());

        }
    }
}

Conclusion

Hopefully through the above example you can get a basic idea about Jsoup, but there’s a lot more you can do with Jsoup. References will definitely help you to go ahead 🙂

References

[1] Jsoup official page: http://jsoup.org/

[2] Hello World Examples: http://www.mkyong.com/java/jsoup-html-parser-hello-world-examples/

[3] Download images (Stackoverflow Q & A): http://stackoverflow.com/questions/5882005/how-to-download-image-from-any-web-page-in-java

Advertisements

6 thoughts on “Parsing HTML pages with Jsoup

    1. Hi Jefrey,
      You can use store “response” byte array to store image in a database. For that you need to add a “blob” type field to your database table.
      More on sqlite can be found in [1] and suggested answer is also described at [2].
      [1] http://www.tutorialspoint.com/sqlite/sqlite_java.htm
      [2] http://stackoverflow.com/questions/17672829/write-byte-array-to-mysql-database-from-java-as-image-or-file

      But I need to say that saving image in a database is not the standard method. Instead you have to store image in the file-system and store the path to the image in database, is the standard procedure.

      To retrieve, you can do reverse (retrieve “blob” and convert to byte array [3]) and use BufferedImage to display it. (more on [4] and [5])

      [3] http://stackoverflow.com/questions/6662432/easiest-way-to-convert-a-blob-into-a-byte-array
      [4] http://www.mkyong.com/java/how-to-convert-byte-to-bufferedimage-in-java/
      [5] http://stackoverflow.com/questions/299495/java-swing-how-to-add-an-image-to-a-jpanel

      Cheers!

  1. I tried to extract an image from other websites with the following code and i had no problems, but them i tried with other website and nothing happend. no image came up.

    protected Void doInBackground(Void… params) {

    try {
    // Connect to the web site
    Document document = Jsoup.connect(https://www.indiegogo.com/project/spy-cam-peek-i/embedded).get();
    // Using Elements to get the class data
    Elements img = document.select(“div.i-project-card i-embedded img[src]”);

    // Locate the src attribute
    String imgSrc = img.attr(“src”);
    //Download image from URL
    InputStream input = new java.net.URL(imgSrc).openStream();
    // Decode Bitmap
    bitmap = BitmapFactory.decodeStream(input);

    } catch (IOException e) {
    e.printStackTrace();
    }
    return null;
    }

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s