C++ code to strip html tags from strings using wxWidgets

I’m writing a small app (NoteSearch) to search through OneNote pages better than the standard OneNote search function. Since OneNote returns html strings and I’m only interested in the text itself I needed a simple function to remove the html tags. wxWidgets provides nearly everything, but I couldn’t find a function, which does this job. So I crawled the internet and found code on some webpage (actually it’s from the book Thinking in C++ – Volume 2: Practical Programming from Bruce Eckel and Chuck Allison), which I “rewrote” for the wxWidgets library.

wxString& stripHTMLTags(wxString& s, bool reset)
{
  static bool inTag = false;
  bool done = false;

  if (reset)
    inTag = false;

  while (!done) {
    if (inTag) {
      // The previous line started an HTML tag
      // but didn't finish. Must search for '>'.
      int rightPos = s.find('>');
      if (rightPos != wxString::npos) {
        inTag = false;
        s.erase(0, rightPos + 1);
      }
      else {
        done = true;
        s.erase();
      }
    }
    else {
      // Look for start of tag:
      size_t leftPos = s.find('<');
      if (leftPos != wxString::npos) {
        // See if tag close is in this line:
        size_t rightPos = s.find('>');
        if (rightPos == wxString::npos) {
          inTag = done = true;
          s.erase(leftPos);
        }
        else
          s.erase(leftPos, rightPos - leftPos + 1);
      }
      else
        done = true;
    }
  }

  // Replace some special HTML characters
  s.Replace("<", "<", true);
  s.Replace(">", ">", true);
  s.Replace("&", "&", true);
  s.Replace(" ", " ", true);
  s.Replace(""", "'", true);

  return s;
}

You might need to replace some more special characters depending on your needs. On a side note I didn’t manage to run SyntaxHighlighter Evolved with the Gutenberg editor of WordPress 5.x. well :(. I just can’t change the block type to SHE. You need to press the + above or below a block to add a SHE block, but source code is not nicely formatted. So I use the EnlighterJS source code formatted and I’m happy.

The last part about the replacement of special HTML characters is obviously to special for the code highlighters. It should actually look like this:

Somehow this is like a dog, which bites in its own tail….

Leave a Reply

Your email address will not be published. Required fields are marked *

 

This site uses Akismet to reduce spam. Learn how your comment data is processed.