-
Notifications
You must be signed in to change notification settings - Fork 1k
Modified hdyc regex to accept spaces and percent-encoded spaces #6468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Modified hdyc regex to accept spaces and percent-encoded spaces #6468
Conversation
| :github => %r{\Ahttps?://(?:www\.)?github\.com/([a-zA-Z0-9_-]+)}, | ||
| :gitlab => %r{\Ahttps?://(?:www\.)?gitlab\.com/([a-zA-Z0-9_-]+)}, | ||
| :hdyc => %r{\Ahttps?://(?:www\.)?hdyc\.neis-one\.org/\?([a-zA-Z0-9_-]+)}, | ||
| :hdyc => %r{\Ahttps?://(?:www\.)?hdyc\.neis-one\.org/\?([a-zA-Z0-9\s%20_-]+)}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to add 20 there as that's a character class so it's matching each character individually and digits are already allowed.
I'm not really keen to handling escaping like this anyway - it would probably be better to canonicalise the URLs before matching them so we don't have to handle both escaped and unescaped versions in the regular expressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also do we want \s or just allowed? Are other forms of whitespace like tab really valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I observed for #6419 the view uses the text truncate class, can that be the reason for the issue ?
If it is, there will be no reason to modify the regex here and just fix the view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question on what exact characters are allowed. HDYC references our display names, so anything valid on our website will be valid for HDYC. This seems to be the relevant declaration in our user validations:
openstreetmap-website/app/models/user.rb
Lines 106 to 109 in 8738786
| validates :display_name, :if => proc { |u| u.display_name_changed? }, | |
| :characters => { :url_safe => true }, | |
| :whitespace => { :leading => false, :trailing => false }, | |
| :width => { :minimum => 3 } |
Which in turn uses a custom validator that excludes these characters:
openstreetmap-website/app/validators/characters_validator.rb
Lines 4 to 5 in 8738786
| INVALID_CHARS = "\x00-\x08\x0b-\x0c\x0e-\x1f\x7f\ufffe\uffff" | |
| INVALID_URL_CHARS = "/;.,?%#" |
So perhaps those constants should be used in the regexp? Similar to what the validator does when it goes /[#{INVALID_CHARS}]/o.match?(value).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, generally on regular expressions: "character classes" (lists of characters surrounded by square brackets [...]) take each character individually, so %20 is %, 2, or 0 individually, but not the three together. The \s does work because \ is a escape character which is a special case. % is a escape character in URLs, but not in regular expressions.
This is a great website to test out regular expressions: https://rubular.com/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, the changes made are valid right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well they don't actually break anything but they also don't really match % escape sequence and it definitely has redundant duplicate characters in the character class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note that the tests are failing, as %20 is not translated to as they expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, the changes made are valid right?
It would allow stuff like %22, %_20, and other invalid
Modified hdyc regex to accept spaces and percent-encoded spaces
For Issue #6419