C#: Different string encoding on attribute vs. constant

Question

Welcome To Ask or Share your Answers For Others

C#: Different string encoding on attribute vs. constant

asked Feb 6, 2021 in Technique[技术] by 深蓝 (71.8m points)

C#: Different string encoding on attribute vs. constant

I'm writing a test for a function aimed to remove invalid code points such as orphaned surrogate pairs. However, I'm seeing a difference in the way the surrogate pair is being encoded depending on how I write the test.

While this version of the test passes:

        [TestCategory("UnitTest")]
        [TestMethod]
        public void RemoveOrhpanedSurrogatePair()
        {
            var input = "uDDDD1975";
            var cleanText = input.ReplaceInvalidCodePoints();

            Assert.AreEqual(input.Length - 1, cleanText.Length);
            Assert.AreEqual("1975", cleanText);
        }

This one does not:

        [TestCategory("UnitTest")]
        [TestMethod]
        [DataRow("uDDDD1975")]
        public void RemoveOrhpanedSurrogatePair(string input)
        {
            var cleanText = input.ReplaceInvalidCodePoints();

            Assert.AreEqual(input.Length - 1, cleanText.Length);
            Assert.AreEqual("1975", cleanText);
        }

Looking at the debugger, the first variation encoded the string as "uDDDD1975" but the second one produces "??1975" which appears as two valid characters instead of one orphaned surrogate pair.

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-02-06T00:59:22+0000

I think a clue to the answer can be found in (what else but) a @jonskeet blog post. Apparently C# uses UTF16 to encode strings everywhere, except in Attribute c'tors where UTF8 is being used. The compiler seems to see that this is an orphaned surrogate pair and treats it via its UTF8 value as two invalid Unicode characters. Those are then being replaced by a pair of uFFFD characters (the Unicode replacement character which is used to indicate broken data when decoding binary to text).

[Description(Value)]
class Test
{
    const string Value = "uDDDD";
 
    static void Main()
    {
        var description = (DescriptionAttribute)
            typeof(Test).GetCustomAttributes(typeof(DescriptionAttribute), true)[0];
        DumpString("Attribute", description.Description);
        DumpString("Constant", Value);
    }
 
    static void DumpString(string name, string text)
    {
        var utf16 = text.Select(c => ((uint) c).ToString("x4"));
        Console.WriteLine("{0}: {1}", name, string.Join(" ", utf16));
    }
}

Will produce:

Attribute: fffd fffd
Constant: dddd

Categories

C#: Different string encoding on attribute vs. constant

C#: Different string encoding on attribute vs. constant

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags