Documenting has never been so easy

Helixoft Blog

Peter Macej - lead developer of VSdocman - talks about Visual Studio tips and automation

Recently, a user of our VSdocman reported an interesting problem that appeared when he edited his XML comments in VSdocman's WYSIWYG comment editor. He copied a text with accented characters from the editor itself or from IE. For example, a french text "période". When he pasted the text to the editor, instead of the "période" a wrong "p�riode" text appeared. As you can see, non-ASCII characters were corrupted.

We could reproduce the problem on some systems and on some systems not. We have isolated the problem to the following line of code in the comment editor:

Clipboard.GetDataObject.GetData(DataFormats.Html)

The DataFormats.Html represents CF_HTML data format which is entirely text format and uses the transformation format UTF-8. So in a DataObject, the data is stored as a byte array of UTF-8 codes. Some characters are encoded as one byte and some as two or more bytes.

When the data is retrieved, the DataObject.GetData(DataFormats.Html) returns a string.

The string is encoded with OS default encoding which is usually ANSI. In this case, each byte is returned as one character. So UTF-8 character encoded as two bytes is retrieved as two characters. For example, the é character (whose 2-byte UTF-8 representation is C3 A9) is converted to two characters é (their ASCII codes are C3 and A9). This was what we expected. The string was malformed but it was possible to convert it to normal UTF-8 string with the following code:

Encoding.UTF8.GetString(Encoding.Default.GetBytes(cf_html))

What we found out was that in the case of the incorrect paste operation, the string returned from GetData was diffrerent. The string was already encoded with UTF-8 encoding. So UTF-8 character encoded as two bytes was retrieved as one character. For example, the é character (whose 2-byte UTF-8 representation is C3 A9) is correctly "converted" to one é character.

While this was nice, it was an inconsistent behaviour. I have tested it in various environments with different OS, Visual Studio version, IE version or regional settings. Finally I found out who was responsible for the mess. It is the .NET framework version 4.5 which returns the string in format different than previous versions of .NET.

  • In .NET 4.0 and earlier, the DataObject.GetData(DataFormats.Html)returns a string encoded with default encoding which is usually ANSI.
  • In .NET 4.5 (and later?), the DataObject.GetData(DataFormats.Html)returns a string encoded with UTF-8 encoding.

If you want to test it, create the following HTML page:

<html>
  <body>
    <p>période</p>
  </body>
</html>

Open the page in Internet Explorer.

Then create and call the following C# method in Visual Studio:

public void ShowClipboard()
{
	try {
		string html = (string)Clipboard.GetDataObject.GetData(DataFormats.Html, true);
		html = "NET=" + Environment.Version.ToString + "\n" + html;
		MessageBox.Show(html);

	} catch (Exception ex) {
		MessageBox.Show(ex.ToString, "Error");
	}
}

In project properties, select Target framework to version 2.0. In the IE, select all text and press Ctrl+C. Execute the method. You'll see the following result:

NET=2.0.50727.5420
Version:1.0
StartHTML:000000211
EndHTML:000000383
StartFragment:000000333
EndFragment:000000341
StartSelection:000000333
EndSelection:000000341
SourceURL:file://helixoft/Users/Public/Test-clipboard.html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<HTML><HEAD></HEAD>

<BODY>

<P><!--StartFragment-->période<!--EndFragment--></P>
</BODY>
</HTML>

Do the same with the Target framework set to version 4.5. You'll get:

NET=4.0.30319.17929
Version:1.0
StartHTML:000000211
EndHTML:000000383
StartFragment:000000333
EndFragment:000000341
StartSelection:000000333
EndSelection:000000341
SourceURL:file://helixoft/Users/Public/Test-clipboard.html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<HTML><HEAD></HEAD>

<BODY>

<P><!--StartFragment-->période<!--EndFragment--></P>
</BODY>
</HTML>


So if you address multiple .NET framework versions and you handle the clipboard and its CF_HTML format (e.g. you filter out certain HTML tags), you need to remember this inconsistent behavior. You may use something like:

string cf_html = (string)Clipboard.GetDataObject.GetData(DataFormats.Html, true);
// .NET 4.5 started with version 4.0.30319.17929
if (Environment.Version.CompareTo(new System.Version("4.0.30319.17929")) >= 0) {
	// .NET 4.5+
	// do nothing, it's OK
} else {
	// .NET 4.0-
	cf_html = Encoding.UTF8.GetString(Encoding.Default.GetBytes(cf_html));
}

Or something more advanced if you need original CF_HTML byte array:

/// <summary>
/// Gets original CF_HTML bytes from the string.
/// </summary>
/// <remarks>
/// <para>The DataFormats.Html (alias <see
/// href="http://msdn.microsoft.com/en-us/library/aa767917.aspx">CF_HTML</see>)
/// format uses UTF-8. So in a DataObject, the data is stored as a byte array of
/// UTF-8 codes. Some characters are encoded as one byte and some as two or more
/// bytes.</para>
/// <para>When the data is retrieved, the <c>DataObject.GetData(DataFormats.Html)</c> returns a string.</para>
/// <para>In .NET 4.0 and earlier, the string is encoded with default encoding which
/// is usually ANSI. In this case, each byte is returned as one character. So UTF-8
/// character encoded as two bytes is retrieved as two characters. For example, the
/// <b>é</b> character (whose 2-byte UTF-8 representation is C3 A9) is converted
/// to two characters <b>é</b> (their ASCII codes are C3 and A9).</para>
/// <para>In .NET 4.5, the string is encoded with UTF-8 encoding. So UTF-8 character
/// encoded as two bytes is retrieved as one character. For example, the <b>é</b>
/// character (whose 2-byte UTF-8 representation is C3 A9) is correctly converted to
/// one <b>é</b> character.</para>
/// <para>To unify this different behavior in various .NET frameworks, this methods
/// detects either case and returns the original byte array. This array then can be
/// converted to single-byte ASCII encoding which ensures correct offset values in
/// CF_HTML data header. Or it can be converted to UTF-8 encoding which returns the
/// correct text but offsets need not match.</para>
/// </remarks>
/// <param name="cf_html">A string retrieved from a DataObject as DataFormats.Html
/// (CF_HTML) format.</param>
/// <returns>
/// Original byte array for the DataFormats.Html (CF_HTML) format as it was stored
/// in a DataObject.
/// </returns>
public byte[] GetBytesFromCF_HtmlString(string cf_html)
{
	if (string.IsNullOrEmpty(cf_html)) {
		return new byte[];
	}

	byte[] bytes = null;
	bytes = Encoding.Default.GetBytes(cf_html);
	string newStr = Encoding.Default.GetString(bytes);
	if (newStr.Equals(cf_html)) {
		// Original string cf_html may be encoded in default encoding (ANSI), .NET 4.0 and earlier.
		// But it also may be encoded in UTF-8, .NET 4.5.
		// It's impossible to distingush between default and UTF8. If, for example, the original string is
		// "é". It may be:
		// 1. In .NET 4.0 and earlier, it is encoded in default encoding (ANSI), representing two bytes C3 A9
		//    that represent "é" in CF_HTML format.
		// 2. In .NET 4.5, it is encoded in UTF8, representing 4 bytes that represent "é" in CF_HTML format.
		// So there is no exact way to detemine which case is applied. We need to resolve it only by .NET
		// version. This is not a clean solution, but...

		// .NET 4.5 started with version 4.0.30319.17929
		if (Environment.Version.CompareTo(new System.Version("4.0.30319.17929")) >= 0) {
			// .NET 4.5+
			bytes = Encoding.UTF8.GetBytes(cf_html);
		} else {
			// .NET 4.0-
		}
	} else {
		// Original string is not encoded in default encoding (ANSI).
		// Some of its characters have no correct representation in the default encoding, e.g. 媽.
		// So original string is encoded in UTF-8, .NET 4.5
		bytes = Encoding.UTF8.GetBytes(cf_html);
	}

	return bytes;
}

And remember, the same applies to DataObject.SetData(DataFormats.Html) too.

 

 

 

Start generating your .NET documentation now!

DOWNLOAD

Free, fully functional trial

Save
Cookies user preferences
We use cookies to ensure you to get the best experience on our website. If you decline the use of cookies, this website may not function as expected.
Accept all
Decline all
Marketing
Set of techniques which have for object the commercial strategy and in particular the market study.
Quantcast
Accept
Decline