-
Notifications
You must be signed in to change notification settings - Fork 25
FIX: Encoding Decoding #265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
size_t copySize = std::min(wstr.size(), info.columnSize); | ||
#if defined(_WIN32) | ||
// Windows: direct copy | ||
wmemcpy(&wcharArray[i * (info.columnSize + 1)], wstr.c_str(), copySize); |
Check warning
Code scanning / devskim
These functions are historically error-prone and have been associated with a significant number of vulnerabilities. Most of these functions have safer alternatives, such as replacing 'strcpy' with 'strlcpy' or 'strcpy_s'. Warning
} | ||
|
||
size_t copySize = std::min(str.size(), info.columnSize); | ||
memcpy(&charArray[i * (info.columnSize + 1)], str.c_str(), copySize); |
Check notice
Code scanning / devskim
There are a number of conditions in which memcpy can introduce a vulnerability (mismatched buffer sizes, null pointers, etc.). More secure alternitives perform additional validation of the source and destination buffer Note
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances character encoding and decoding support in the mssql-python library to address issues with non-UTF-8 character sets, particularly East Asian encodings like GBK. The main focus is on making encoding and decoding settings dynamically configurable and properly handling character conversion during query execution and result fetching.
- Added dynamic encoding/decoding configuration retrieval from connection objects
- Enhanced parameter binding to use connection-specific encoding settings for SQL_C_CHAR and SQL_C_WCHAR types
- Updated result fetching to apply proper character decoding based on connection settings
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
File | Description |
---|---|
tests/test_003_connection.py | Added comprehensive test cases for various encoding scenarios including GBK, UTF-8, East Asian characters, and diagnostic tests |
mssql_python/pybind/ddbc_bindings.cpp | Enhanced parameter binding and result fetching with encoding-aware string conversion functions and extensive debug logging |
mssql_python/cursor.py | Added helper methods to retrieve encoding/decoding settings from connection and updated execute/fetch methods to use dynamic settings |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
LOG("Appended NVARCHAR string of length {} to result row", numCharsInData); | ||
} else { | ||
// Use the common decoding function | ||
row.append(DecodeString(dataBuffer.data(), dataLen, wcharEncoding, true)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the wide (WCHAR/NVARCHAR) fetch paths we compute numCharsInData = dataLen / sizeof(SQLWCHAR) and then sometimes pass either dataLen or a recomputed numCharsInData * sizeof(SQLWCHAR) into DecodeString. If the buffer includes a null terminator or unused capacity, or if dataLen is odd or cuts a surrogate pair in half, we can end up mis-decoding or silently corrupting data (especially for emoji / supplementary plane characters).
if (columnSize == SQL_NO_TOTAL || columnSize > 4000) { | ||
LOG("Streaming LOB for column {} (NVARCHAR)", i); | ||
row.append(FetchLobColumnData(hStmt, i, SQL_C_WCHAR, true, false)); | ||
row.append(FetchLobColumnData(hStmt, i, SQL_C_WCHAR, true, false, charEncoding, wcharEncoding)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirm FetchLobColumnData passes the actual ODBC-reported byte length (not buffer capacity) into DecodeString; wide length must remain in bytes.
// Use unix-specific conversion to handle the wchar_t/SQLWCHAR size difference | ||
SQLWCHAR* wcharData = &buffers.wcharBuffers[col - 1][i * fetchBufferSize]; | ||
// Use DecodeString directly with the raw data | ||
py::object decodedStr = DecodeString(wcharData, numCharsInData * sizeof(SQLWCHAR), wcharEncoding, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we reconstruct byte length as numCharsInData * sizeof(SQLWCHAR). If numCharsInData was derived from a buffer that includes a null terminator or unused capacity, or if a surrogate pair boundary was truncated, we could mis-decode. Prefer passing the original ODBC dataLen (in bytes) captured per row/column rather than recomputing. Also add a defensive check: if (bytesLen % 2 != 0) log & fail for wide data.
Please talk to @gargsaumya for these implementations
} | ||
} | ||
|
||
py::bytes EncodeString(const std::string& text, const std::string& encoding, bool toWideChar) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Python str parameters we already have Unicode; avoid legacy round‑trip (decode >> encode >> decode>> encode). Directly encode once (UTF‑16LE for wide, requested encoding for narrow). This will remove multiple allocations and potential lossy conversions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a bunch of comments..
sqlwchars.size() * sizeof(SQLWCHAR)); | ||
wcharArray[i * (info.columnSize + 1) + sqlwchars.size()] = 0; |
Check notice
Code scanning / devskim
There are a number of conditions in which memcpy can introduce a vulnerability (mismatched buffer sizes, null pointers, etc.). More secure alternitives perform additional validation of the source and destination buffer Note
Work Item / Issue Reference
Summary
This pull request improves the handling of character encoding and decoding in the
mssql_python/cursor.py
module. The main changes ensure that encoding and decoding settings are dynamically retrieved from the connection, allowing for more robust and flexible support for different character sets when executing queries and fetching results.Encoding and decoding settings improvements:
_get_encoding_settings
and_get_decoding_settings
helper methods to retrieve encoding and decoding configurations from the connection, with sensible fallbacks if unavailable.Query execution enhancements:
Updated the
execute
andexecutemany
methods to use dynamic encoding and character type settings when calling the underlying DDBC bindings, ensuring queries are sent with the correct encoding.Result fetching improvements:
Modified
fetchone
,fetchmany
, andfetchall
methods to use dynamic decoding settings for character and wide character data, improving reliability and compatibility when reading results from the database.