Logo Icon Logo
A Crowd-sourced Cookbook on Writing Great Android® Apps
GitHub logo Twitter logo OReilly Book Cover Art

Extracting Information from Unstructured Text Using Regular Expressions

Author: Ian Darwin -- Published? true -- FormatLanguage: W


You want to get information from another source, but they don't make it available as information, only as a viewable web page.


Use java.net to download the HTML page, and use Regular Expressions to extract the information from the page.


If you aren't already a big fan of regular expressions, well, you should be. And maybe this recipe will help interest you in learning regex technology.

Suppose that I, as a published author, want to track how my book is selling in comparison to others. This information can be obtained for free just by clicking on the page for my book on any of the major bookseller sites, reading the sales rank number off the screen, and typing the number into a file-but that's too tedious. As I wrote in one of my earlier books, "computers get paid to extract relevant information from files; people should not have to do such mundane tasks." This program uses the Regular Expressions API and, in particular, newline matching to extract a value from an HTML page on the Amazon.com web site. It also reads from a URL object (see Using a RESTful Web Service). The pattern to look for is something like this (bear in mind that the HTML may change at any time, so I want to keep the pattern fairly general):

(bookstore name here) Sales Rank:
# 26,252

As the pattern may extend over more than one line, I read the entire web page from the URL into a single long string using a private convenience routine readerToString() instead of the more traditional line-at-a-time paradigm. The value is extracted from the regular expression, converted to an integer value, and returned. The longer version of this code in the Java Cookbook would also plot a graph using an external program. The complete program is shown in this example.

// Part of class BookRank
public static int getBookRank(String isbn) throws IOException {
	// The RE pattern - digits and commas allowed
	final String pattern = "Rank:</b> #([\\d,]+)";
	final Pattern r = Pattern.compile(pattern);

	// The url -- must have the "isbn=" at the very end, or otherwise
	// be amenable to being appended to.
	final String url = "http://www.amazon.com/exec/obidos/ASIN/" + isbn;

	// Open the URL and get a Reader from it.
	final BufferedReader is = new BufferedReader(new InputStreamReader(
		new URL(url).openStream()));
	// Read the URL looking for the rank information, as
	// a single long string, so can match RE across multi-lines.
	final String input = readerToString(is);

	// If found, append to sales data file.
	Matcher m = r.matcher(input);
	if (m.find()) {
		// Group 1 is digits (and maybe ','s) that matched; remove comma
		return Integer.parseInt(m.group(1).replace(",",""));
	} else {
		throw new RuntimeException(
			"Pattern not matched in `" + url + "'!");

See Also:

As mentioned, using the regex API is vital to being able to deal with semi-structured data that you will meet in real life. Chapter Four of the Java Cookbook is all about regex, as is Jeffrey Friedl's comprehensive Mastering Regular Expressions.


The source code for this project can be downloaded from http://javacook.darwinsys.com/javasrc/regex/BookRank.java.
icyerasor 2010-07-13 03:55:16.393 Regex is okay to extract single parts of information i guess. I don't know if a simple java html (tag soup) parser wouldn't be more efficient. But always remember: don't use regex to parse html! http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454