Longest common substring does not handle unicode properly

It seems that the algorithm longestCommonSubstring does not handle unicode characters properly:

longestCommonSubstr('𐌵𐌵**ABC', '𐌵𐌵--ABC') === '𐌵𐌵'
// whereas the longest one should be ABC (in terms of number of code points)

// Number of code points:
[...'𐌵𐌵'].length === 2
[...'ABC'].length === 3

// Number of "characters":
'𐌵𐌵'.length === 4
'ABC'.length === 3

You should maybe add a note on the algorithm regarding this. Basically the problem can occur whenever the strings contain characters outside the BMP range (ie code points greater than 0xffff).

Feel free to close the issue whenever you want. The aim was just to signal the problem is case you want to patch it in a way.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Longest common substring does not handle unicode properly #129

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Longest common substring does not handle unicode properly #129

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions