There are times when it's useful to extract the plain text from a HTML document.One example:
You're working with a database that supports full-text indexing, and you know you don't want the index cluttered with useless entries - like HTML tags.
One of the things I really like about software development is that there are usually many ways to accomplish a given task.Often, the challenge is choosing the most satisfactory solution from a pool of dozens of candidates.
Notice that I didn't say best solution.There is seldom a clearly best way to accomplish something, and good arguments can usually be made for two or three candidates.
Take the task at hand, here: extracting the plain text from a HTML document.
If you're ambitious, you might be tempted to write your own parser. Problem is, even if writing parsers is a familiar part of your skill set, you have to have a nagging hunch that you'd be reinventing the wheel.
You could go shopping for a code library or component to handle the task. That's certainly a valid approach - and one I tried.The problem with this approach is that youre likely find several candidates, and that means, not only time spent evaluating each, but also learning to use the one you select.
After a couple of unsatisfactory hours pursuing the above approach, I thought, "Hey, Delphi has a TWebBrowser component, and, using it, I should be able to get at the plain text.Maybe."
So I went to my favorite web site for getting answers to development issues, http://tamaracka.com.This is hosted by the fine people that make the Rubicon text indexing add-on for databases and, naturally, is powered by Rubicon.They periodically archive all the posts from the Borland, Microsoft, and third-party library vendors' news groups.
I typed "TWebBrowser.Document" into their Borland search field and, within moments, found exactly what I was looking for: Source to a function that uses TWebBrowser (and some neat tricks) to return the plain text from a string of HTML.
I copied-n-pasted it into a simple "Proof of Concept" Delphi application to check it out, and, by golly, it worked beautifully.
It was originally posted to the borland.public.delphi.thirdpartytools.general news group by somebody named Craig.Thanks a million Craig! Your name and fine work live on in the Delphi demo project attached to this article.
Here's Craig's function:
function HtmlToText(const _html: string): string; var WebBrowser: TWebBrowser;
Result := '';
WebBrowser := TWebBrowser.Create(nil);
Doc := 'about:blank';
Document := WebBrowser.Document as IHtmlDocument2;
if (Assigned(Document)) then
v := VarArrayCreate([0, 0], varVariant);
v := _html;
Body := Document.body as IHTMLBodyElement;
TextRange := Body.createTextRange;
Result := TextRange.text;
Note: You'll have to "use" these units for the function to work: ShDocVw, MSHTML, and ActiveX.
The attached demo project is written in Delphi 7, but should work with any version of Delphi that includes TWebBrowser.