I'm writing a test for a function aimed to remove invalid code points such as orphaned surrogate pairs.
However, I'm seeing a difference in the way the surrogate pair is being encoded depending on how I write the test.
While this version of the test passes:
[TestCategory("UnitTest")]
[TestMethod]
public void RemoveOrhpanedSurrogatePair()
{
var input = "uDDDD1975";
var cleanText = input.ReplaceInvalidCodePoints();
Assert.AreEqual(input.Length - 1, cleanText.Length);
Assert.AreEqual("1975", cleanText);
}
This one does not:
[TestCategory("UnitTest")]
[TestMethod]
[DataRow("uDDDD1975")]
public void RemoveOrhpanedSurrogatePair(string input)
{
var cleanText = input.ReplaceInvalidCodePoints();
Assert.AreEqual(input.Length - 1, cleanText.Length);
Assert.AreEqual("1975", cleanText);
}
Looking at the debugger, the first variation encoded the string as "uDDDD1975"
but the second one produces "??1975"
which appears as two valid characters instead of one orphaned surrogate pair.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…