FIX: Encoding Decoding #265

jahnvi480 · 2025-09-30T06:54:04Z

Work Item / Issue Reference

AB#39049

GitHub Issue: #250

Summary

This pull request improves the handling of character encoding and decoding in the mssql_python/cursor.py module. The main changes ensure that encoding and decoding settings are dynamically retrieved from the connection, allowing for more robust and flexible support for different character sets when executing queries and fetching results.

Encoding and decoding settings improvements:

Added _get_encoding_settings and _get_decoding_settings helper methods to retrieve encoding and decoding configurations from the connection, with sensible fallbacks if unavailable.

Query execution enhancements:

Updated the execute and executemany methods to use dynamic encoding and character type settings when calling the underlying DDBC bindings, ensuring queries are sent with the correct encoding.
Result fetching improvements:
Modified fetchone, fetchmany, and fetchall methods to use dynamic decoding settings for character and wide character data, improving reliability and compatibility when reading results from the database.

mssql_python/pybind/ddbc_bindings.cpp

+                        size_t copySize = std::min(wstr.size(), info.columnSize);
+                #if defined(_WIN32)
+                        // Windows: direct copy
+                        wmemcpy(&wcharArray[i * (info.columnSize + 1)], wstr.c_str(), copySize);


mssql_python/pybind/ddbc_bindings.cpp

                        }
+
+                        size_t copySize = std::min(str.size(), info.columnSize);
+                        memcpy(&charArray[i * (info.columnSize + 1)], str.c_str(), copySize);


Copilot

Pull Request Overview

This PR enhances character encoding and decoding support in the mssql-python library to address issues with non-UTF-8 character sets, particularly East Asian encodings like GBK. The main focus is on making encoding and decoding settings dynamically configurable and properly handling character conversion during query execution and result fetching.

Added dynamic encoding/decoding configuration retrieval from connection objects
Enhanced parameter binding to use connection-specific encoding settings for SQL_C_CHAR and SQL_C_WCHAR types
Updated result fetching to apply proper character decoding based on connection settings

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
tests/test_003_connection.py	Added comprehensive test cases for various encoding scenarios including GBK, UTF-8, East Asian characters, and diagnostic tests
mssql_python/pybind/ddbc_bindings.cpp	Enhanced parameter binding and result fetching with encoding-aware string conversion functions and extensive debug logging
mssql_python/cursor.py	Added helper methods to retrieve encoding/decoding settings from connection and updated execute/fetch methods to use dynamic settings

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

mssql_python/pybind/ddbc_bindings.cpp

tests/test_003_connection.py

… jahnvi/githubissue_250

mssql_python/pybind/ddbc_bindings.cpp

mssql_python/connection.py

mssql_python/cursor.py

mssql_python/connection.py

mssql_python/pybind/ddbc_bindings.cpp

mssql_python/cursor.py

mssql_python/pybind/ddbc_bindings.cpp

sumitmsft · 2025-10-16T10:47:59Z

mssql_python/pybind/ddbc_bindings.cpp

-                                LOG("Appended NVARCHAR string of length {} to result row", numCharsInData);
-                            }  else {
+                                // Use the common decoding function
+                                row.append(DecodeString(dataBuffer.data(), dataLen, wcharEncoding, true));


In the wide (WCHAR/NVARCHAR) fetch paths we compute numCharsInData = dataLen / sizeof(SQLWCHAR) and then sometimes pass either dataLen or a recomputed numCharsInData * sizeof(SQLWCHAR) into DecodeString. If the buffer includes a null terminator or unused capacity, or if dataLen is odd or cuts a surrogate pair in half, we can end up mis-decoding or silently corrupting data (especially for emoji / supplementary plane characters).

sumitmsft · 2025-10-16T10:48:54Z

mssql_python/pybind/ddbc_bindings.cpp

                if (columnSize == SQL_NO_TOTAL || columnSize > 4000) {
                    LOG("Streaming LOB for column {} (NVARCHAR)", i);
-                    row.append(FetchLobColumnData(hStmt, i, SQL_C_WCHAR, true, false));
+                    row.append(FetchLobColumnData(hStmt, i, SQL_C_WCHAR, true, false, charEncoding, wcharEncoding));


Confirm FetchLobColumnData passes the actual ODBC-reported byte length (not buffer capacity) into DecodeString; wide length must remain in bytes.

sumitmsft · 2025-10-16T10:49:52Z

mssql_python/pybind/ddbc_bindings.cpp

+                            // Use unix-specific conversion to handle the wchar_t/SQLWCHAR size difference
+                            SQLWCHAR* wcharData = &buffers.wcharBuffers[col - 1][i * fetchBufferSize];
+                            // Use DecodeString directly with the raw data
+                            py::object decodedStr = DecodeString(wcharData, numCharsInData * sizeof(SQLWCHAR), wcharEncoding, true);


Here we reconstruct byte length as numCharsInData * sizeof(SQLWCHAR). If numCharsInData was derived from a buffer that includes a null terminator or unused capacity, or if a surrogate pair boundary was truncated, we could mis-decode. Prefer passing the original ODBC dataLen (in bytes) captured per row/column rather than recomputing. Also add a defensive check: if (bytesLen % 2 != 0) log & fail for wide data.

Please talk to @gargsaumya for these implementations

sumitmsft · 2025-10-16T10:51:24Z

mssql_python/pybind/ddbc_bindings.cpp

+    }
+}
+
+py::bytes EncodeString(const std::string& text, const std::string& encoding, bool toWideChar) {


For Python str parameters we already have Unicode; avoid legacy round‑trip (decode >> encode >> decode>> encode). Directly encode once (UTF‑16LE for wide, requested encoding for narrow). This will remove multiple allocations and potential lossy conversions.

mssql_python/pybind/ddbc_bindings.cpp

sumitmsft

Left a bunch of comments..

mssql_python/pybind/ddbc_bindings.cpp

+                            sqlwchars.size() * sizeof(SQLWCHAR));
+                        wcharArray[i * (info.columnSize + 1) + sqlwchars.size()] = 0;


FIX: Encoding Decoding

ffc1332

Copilot AI review requested due to automatic review settings September 30, 2025 06:54

github-actions bot added the pr-size: large Substantial code update label Sep 30, 2025

github-advanced-security bot found potential problems Sep 30, 2025

View reviewed changes

Copilot AI reviewed Sep 30, 2025

View reviewed changes

jahnvi480 and others added 4 commits October 7, 2025 12:14

Merge branch 'main' into jahnvi/githubissue_250

523b55e

Merge branch 'main' of https://github.com/microsoft/mssql-python into…

32bc203

… jahnvi/githubissue_250

Resolving SQL_WCHAR issue

bc8d7a6

Resolving issues

ac620f8