OK, following on from the comments above, I think it's highly likely that the input string is in UTF-8 (after all, in an HTML context, what else would it be?).
On that basis, I humbly submit this:
#include <string>
#include <codecvt>
#include <locale>
std::string narrow (const std::wstring& ws)
{
std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
return convert.to_bytes (ws);
}
std::wstring widen (const std::string& s)
{
std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
return convert.from_bytes (s);
}
std::string detect_Unicode (const std::string& s)
{
std::wstring ws = widen (s);
if (ws.empty() || ws.find_first_not_of (L"
fvu00A0u00C2u00E2u20ACu2039") != std::wstring::npos)
return " ";
return s;
}
#include <iostream>
int main ()
{
std::cout << narrow (L"u00A0 u00C2 u00E2 u20AC u2039
");
std::cout << "0."" << detect_Unicode (u8"abcde") << ""
";
std::cout << "1."" << detect_Unicode (u8" a€? a€? ") << ""
";
std::cout << "2."" << detect_Unicode (u8"are ? ? there is something ? ? ? a€? combination a€?") << ""
";
std::cout << "3."" << detect_Unicode (u8" ? ? ") << ""
";
std::cout << "4."" << detect_Unicode (u8"a€? ? ? a€?") << ""
";
std::cout << "5."" << detect_Unicode (u8"? ? a a") << ""
";
}
Output:
? a € ?
0. " "
1. " a€? a€? "
2. " "
3. " ? ? "
4. "a€? ? ? a€?"
5. "? ? a a"
Now this is not the output the OP expects, but I think that's simply because the logic (as opposed to the implementation) of detect_Unicode()
looks flawed. The point here is that converting the input string to a wide string means that you can use standard basic_string
operations on it reliably, because there are no multibyte issues now.
An alternative, slightly radical, implementation of detect_Unicode()
might be:
for (auto wide_char : ws)
{
if (wide_char > 0xff)
return " ";
}
return s;
But really, now you have a wide string to hand in detect_Unicode
, anything is possible, so go wild OP.
Other notes:
std::codecvt
is deprecated in C++17, but since there is no other obvious choice you might as well run with it. You can always change the implementations of narrow
and widen
if it comes to it.
- Depending on platform,
std::wstring
might not be the best choice but it's probably fine. You could also look at std::u16string
and std::u32string
.
Live demo.
Inspiration taken from here.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…