{"id":301,"date":"2019-02-18T16:19:48","date_gmt":"2019-02-18T15:19:48","guid":{"rendered":"http:\/\/www.miscdebris.net\/blog\/?p=301"},"modified":"2019-02-18T16:37:24","modified_gmt":"2019-02-18T15:37:24","slug":"c-code-to-strip-html-tags-from-strings-using-wxwidgets","status":"publish","type":"post","link":"http:\/\/www.miscdebris.net\/blog\/2019\/02\/18\/c-code-to-strip-html-tags-from-strings-using-wxwidgets\/","title":{"rendered":"C++ code to strip html tags from strings using wxWidgets"},"content":{"rendered":"\n<p>I&#8217;m writing a small app (NoteSearch) to search through OneNote pages better than the standard OneNote search function. Since OneNote returns html strings and I&#8217;m only interested in the text itself I needed a simple function to remove the html tags. wxWidgets provides nearly everything, but I couldn&#8217;t find a function, which does this job. So I crawled the internet and found code on some <a href=\"https:\/\/www.linuxtopia.org\/online_books\/programming_books\/c++_practical_programming\/c++_practical_programming_065.html\">webpage<\/a> (actually it&#8217;s from the book <a href=\"http:\/\/web.mit.edu\/merolish\/ticpp\/TicV2.html\">Thinking in C++ &#8211; Volume 2: Practical Programming<\/a> from Bruce Eckel and Chuck Allison), which I &#8220;rewrote&#8221; for the <a href=\"https:\/\/www.wxwidgets.org\">wxWidgets<\/a> library.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"cpp\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">wxString&amp; stripHTMLTags(wxString&amp; s, bool reset)\n{\n  static bool inTag = false;\n  bool done = false;\n\n  if (reset)\n    inTag = false;\n\n  while (!done) {\n    if (inTag) {\n      \/\/ The previous line started an HTML tag\n      \/\/ but didn't finish. Must search for '>'.\n      int rightPos = s.find('>');\n      if (rightPos != wxString::npos) {\n        inTag = false;\n        s.erase(0, rightPos + 1);\n      }\n      else {\n        done = true;\n        s.erase();\n      }\n    }\n    else {\n      \/\/ Look for start of tag:\n      size_t leftPos = s.find('&lt;');\n      if (leftPos != wxString::npos) {\n        \/\/ See if tag close is in this line:\n        size_t rightPos = s.find('>');\n        if (rightPos == wxString::npos) {\n          inTag = done = true;\n          s.erase(leftPos);\n        }\n        else\n          s.erase(leftPos, rightPos - leftPos + 1);\n      }\n      else\n        done = true;\n    }\n  }\n\n  \/\/ Replace some special HTML characters\n  s.Replace(\"&lt;\", \"&lt;\", true);\n  s.Replace(\">\", \">\", true);\n  s.Replace(\"&amp;\", \"&amp;\", true);\n  s.Replace(\"\u00c2\u00a0\", \" \", true);\n  s.Replace(\"\"\", \"'\", true);\n\n  return s;\n}<\/pre>\n\n\n\n<p>You might need to replace some more special characters depending on your needs. On a side note I didn&#8217;t manage to run <a href=\"https:\/\/de.wordpress.org\/plugins\/syntaxhighlighter\/\">SyntaxHighlighter Evolved<\/a> with the Gutenberg editor of WordPress 5.x. well :(. <del>I just can&#8217;t change the block type to SHE.<\/del> You need to press the + above or below a block to add a SHE block, but source code is not nicely formatted. So I use the <a href=\"https:\/\/enlighterjs.org\">EnlighterJS<\/a> source code formatted and I&#8217;m happy.<\/p>\n\n\n\n<p>The last part about the replacement of special HTML characters is obviously to special for the code highlighters. It should actually look like this:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"http:\/\/www.miscdebris.net\/blog\/wp-content\/uploads\/2019\/02\/image.tif\" alt=\"\" class=\"wp-image-331\"\/><figcaption>Somehow this is like a dog, which bites in its own tail&#8230;.<\/figcaption><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;m writing a small app (NoteSearch) to search through OneNote pages better than the standard OneNote search function. Since OneNote returns html strings and I&#8217;m only interested in the text itself I needed a simple function to remove the html tags. wxWidgets provides nearly everything, but I couldn&#8217;t find a function, which does this job. &hellip; <a href=\"http:\/\/www.miscdebris.net\/blog\/2019\/02\/18\/c-code-to-strip-html-tags-from-strings-using-wxwidgets\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">C++ code to strip html tags from strings using wxWidgets<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","enabled":false},"version":2}},"categories":[30],"tags":[],"class_list":["post-301","post","type-post","status-publish","format-standard","hentry","category-development"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p6pnj-4R","_links":{"self":[{"href":"http:\/\/www.miscdebris.net\/blog\/wp-json\/wp\/v2\/posts\/301"}],"collection":[{"href":"http:\/\/www.miscdebris.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.miscdebris.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.miscdebris.net\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.miscdebris.net\/blog\/wp-json\/wp\/v2\/comments?post=301"}],"version-history":[{"count":22,"href":"http:\/\/www.miscdebris.net\/blog\/wp-json\/wp\/v2\/posts\/301\/revisions"}],"predecessor-version":[{"id":332,"href":"http:\/\/www.miscdebris.net\/blog\/wp-json\/wp\/v2\/posts\/301\/revisions\/332"}],"wp:attachment":[{"href":"http:\/\/www.miscdebris.net\/blog\/wp-json\/wp\/v2\/media?parent=301"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.miscdebris.net\/blog\/wp-json\/wp\/v2\/categories?post=301"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.miscdebris.net\/blog\/wp-json\/wp\/v2\/tags?post=301"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}