Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
278 views
in Technique[技术] by (71.8m points)

C#: Different string encoding on attribute vs. constant

I'm writing a test for a function aimed to remove invalid code points such as orphaned surrogate pairs. However, I'm seeing a difference in the way the surrogate pair is being encoded depending on how I write the test.

While this version of the test passes:

        [TestCategory("UnitTest")]
        [TestMethod]
        public void RemoveOrhpanedSurrogatePair()
        {
            var input = "uDDDD1975";
            var cleanText = input.ReplaceInvalidCodePoints();

            Assert.AreEqual(input.Length - 1, cleanText.Length);
            Assert.AreEqual("1975", cleanText);
        }

This one does not:

        [TestCategory("UnitTest")]
        [TestMethod]
        [DataRow("uDDDD1975")]
        public void RemoveOrhpanedSurrogatePair(string input)
        {
            var cleanText = input.ReplaceInvalidCodePoints();

            Assert.AreEqual(input.Length - 1, cleanText.Length);
            Assert.AreEqual("1975", cleanText);
        }

Looking at the debugger, the first variation encoded the string as "uDDDD1975" but the second one produces "??1975" which appears as two valid characters instead of one orphaned surrogate pair.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

I think a clue to the answer can be found in (what else but) a @jonskeet blog post. Apparently C# uses UTF16 to encode strings everywhere, except in Attribute c'tors where UTF8 is being used. The compiler seems to see that this is an orphaned surrogate pair and treats it via its UTF8 value as two invalid Unicode characters. Those are then being replaced by a pair of uFFFD characters (the Unicode replacement character which is used to indicate broken data when decoding binary to text).

[Description(Value)]
class Test
{
    const string Value = "uDDDD";
 
    static void Main()
    {
        var description = (DescriptionAttribute)
            typeof(Test).GetCustomAttributes(typeof(DescriptionAttribute), true)[0];
        DumpString("Attribute", description.Description);
        DumpString("Constant", Value);
    }
 
    static void DumpString(string name, string text)
    {
        var utf16 = text.Select(c => ((uint) c).ToString("x4"));
        Console.WriteLine("{0}: {1}", name, string.Join(" ", utf16));
    }
}

Will produce:

Attribute: fffd fffd
Constant: dddd

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
...