Skip to main content

1 minute read - JSoup Java For Testers

JSoup Tip How to get raw element text with newlines in Java - Parsing HTML and XML with JSoup

Apr 13, 2017

TL;DR with JSoup either switch off document pretty printing or use textNodes to pull the raw text from an element.

thumbnail of video

A quick tip for JSoup.

I wanted to pull out the raw text from an HTML element and retain the \n newline characters. But HTML doesn’t care about those so JSOUP normally parses them away.

I found two ways to access them.

  • switching off pretty printing
  • using the textNodes

Switching off Pretty Printing

When you parse a document in JSoup you can switch off the prettyPrint

Document doc = Jsoup.parse(filename, "UTF-8", "");

Then when you access the html or other text in an element you can find all the \n characters in the text.

String textA = element.html();

Use the textNodes

This approach works regardless of whether you have prettyPrint on or off:

String text = "";
for(TextNode node : element.textNodes()){
    text = text + node + "\n\n";

If you accidentally use both methods then you might get confused.

I think I prefer the second approach because it works regardless.

You can find code that illustrates this on github in the file

See also the accompanying YouTube Video:

Watch on YouTube

You will need a Github account to comment. Or you can contact me with your comment.

I reserve the right to delete spam comments e.g. if your comment adds no value and its purpose is simply to create a backlink to another site offering training, or courses, etc.