Skip to content

Commit 64b3402

Browse files
authored
Merge pull request MicrosoftDocs#2348 from stwish-msft/patch-16
Add C Runtime UTF-8 support documentation
2 parents aaf36c5 + 5138605 commit 64b3402

File tree

3 files changed

+41
-31
lines changed

3 files changed

+41
-31
lines changed

docs/c-runtime-library/locale-names-languages-and-country-region-strings.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,12 @@ _wsetlocale(LC_ALL, L"de-DE");
3737
_wsetlocale(LC_ALL, L"LC_MONETARY=en-GB;LC_TIME=es-ES");
3838
```
3939
40+
41+
## UTF-8 Support
42+
43+
UTF-8 support can be enabled by using the UTF-8 code page in your locale string. See the [UTF-8 Support section of `setlocale`](../c-runtime-library/reference/setlocale-wsetlocale.md#utf-8-support) for more information.
44+
45+
4046
## See also
4147
4248
[C Run-Time Library Reference](../c-runtime-library/c-run-time-library-reference.md)<br/>

docs/c-runtime-library/reference/setlocale-wsetlocale.md

Lines changed: 35 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -52,44 +52,44 @@ sets all categories, returning only the string
5252
en-US
5353
```
5454

55-
You can copy the string returned by **setlocale** to restore that part of the program's locale information. Global or thread local storage is used for the string returned by **setlocale**. Later calls to **setlocale** overwrite the string, which invalidates string pointers returned by earlier calls.
55+
You can copy the string returned by `setlocale` to restore that part of the program's locale information. Global or thread local storage is used for the string returned by `setlocale`. Later calls to `setlocale` overwrite the string, which invalidates string pointers returned by earlier calls.
5656

5757
## Remarks
5858

59-
Use the **setlocale** function to set, change, or query some or all of the current program locale information specified by *locale* and *category*. *locale* refers to the locality (country/region and language) for which you can customize certain aspects of your program. Some locale-dependent categories include the formatting of dates and the display format for monetary values. If you set *locale* to the default string for a language that has multiple forms supported on your computer, you should check the **setlocale** return value to see which language is in effect. For example, if you set *locale* to "chinese" the return value could be either "chinese-simplified" or "chinese-traditional".
59+
Use the `setlocale` function to set, change, or query some or all of the current program locale information specified by *locale* and *category*. *locale* refers to the locality (country/region and language) for which you can customize certain aspects of your program. Some locale-dependent categories include the formatting of dates and the display format for monetary values. If you set *locale* to the default string for a language that has multiple forms supported on your computer, you should check the `setlocale` return value to see which language is in effect. For example, if you set *locale* to "chinese" the return value could be either "chinese-simplified" or "chinese-traditional".
6060

61-
**_wsetlocale** is a wide-character version of **setlocale**; the *locale* argument and return value of **_wsetlocale** are wide-character strings. **_wsetlocale** and **setlocale** behave identically otherwise.
61+
`_wsetlocale` is a wide-character version of `setlocale`; the *locale* argument and return value of `_wsetlocale` are wide-character strings. `_wsetlocale` and `setlocale` behave identically otherwise.
6262

6363
By default, this function's global state is scoped to the application. To change this, see [Global state in the CRT](../global-state.md).
6464

6565
### Generic-Text Routine Mappings
6666

6767
|TCHAR.H routine|_UNICODE & _MBCS not defined|_MBCS defined|_UNICODE defined|
6868
|---------------------|------------------------------------|--------------------|-----------------------|
69-
|**_tsetlocale**|**setlocale**|**setlocale**|**_wsetlocale**|
69+
|`_tsetlocale`|`setlocale`|`setlocale`|`_wsetlocale`|
7070

7171
The *category* argument specifies the parts of a program's locale information that are affected. The macros used for *category* and the parts of the program they affect are as follows:
7272

7373
|*category* flag|Affects|
7474
|-|-|
75-
| **LC_ALL** | All categories, as listed below. |
76-
| **LC_COLLATE** | The **strcoll**, **_stricoll**, **wcscoll**, **_wcsicoll**, **strxfrm**, **_strncoll**, **_strnicoll**, **_wcsncoll**, **_wcsnicoll**, and **wcsxfrm** functions. |
77-
| **LC_CTYPE** | The character-handling functions (except **isdigit**, **isxdigit**, **mbstowcs**, and **mbtowc**, which are unaffected). |
78-
| **LC_MONETARY** | Monetary-formatting information returned by the **localeconv** function. |
79-
| **LC_NUMERIC** | Decimal-point character for the formatted output routines (such as **printf**), for the data-conversion routines, and for the non-monetary formatting information returned by **localeconv**. In addition to the decimal-point character, **LC_NUMERIC** sets the thousands separator and the grouping control string returned by [localeconv](localeconv.md). |
80-
| **LC_TIME** | The **strftime** and **wcsftime** functions. |
75+
| `LC_ALL` | All categories, as listed below. |
76+
| `LC_COLLATE` | The `strcoll`, `_stricoll`, `wcscoll`, `_wcsicoll`, `strxfrm`, `_strncoll`, `_strnicoll`, `_wcsncoll`, `_wcsnicoll`, and `wcsxfrm` functions. |
77+
| `LC_CTYPE` | The character-handling functions (except `isdigit`, `isxdigit`, `mbstowcs`, and `mbtowc`, which are unaffected). |
78+
| `LC_MONETARY` | Monetary-formatting information returned by the `localeconv` function. |
79+
| `LC_NUMERIC` | Decimal-point character for the formatted output routines (such as `printf`), for the data-conversion routines, and for the non-monetary formatting information returned by `localeconv`. In addition to the decimal-point character, `LC_NUMERIC` sets the thousands separator and the grouping control string returned by [localeconv](localeconv.md). |
80+
| `LC_TIME` | The `strftime` and `wcsftime` functions. |
8181

82-
This function validates the category parameter. If the category parameter isn't one of the values given in the previous table, the invalid parameter handler is invoked, as described in [Parameter Validation](../../c-runtime-library/parameter-validation.md). If execution is allowed to continue, the function sets **errno** to **EINVAL** and returns **NULL**.
82+
This function validates the category parameter. If the category parameter isn't one of the values given in the previous table, the invalid parameter handler is invoked, as described in [Parameter Validation](../../c-runtime-library/parameter-validation.md). If execution is allowed to continue, the function sets `errno` to `EINVAL` and returns `NULL`.
8383

84-
The *locale* argument is a pointer to a string that specifies the locale. For information about the format of the *locale* argument, see [Locale Names, Languages, and Country/Region Strings](../../c-runtime-library/locale-names-languages-and-country-region-strings.md). If *locale* points to an empty string, the locale is the implementation-defined native environment. A value of **C** specifies the minimal ANSI conforming environment for C translation. The **C** locale assumes that all **`char`** data types are 1 byte and that their value is always less than 256.
84+
The *locale* argument is a pointer to a string that specifies the locale. For information about the format of the *locale* argument, see [Locale Names, Languages, and Country/Region Strings](../../c-runtime-library/locale-names-languages-and-country-region-strings.md). If *locale* points to an empty string, the locale is the implementation-defined native environment. A value of `C` specifies the minimal ANSI conforming environment for C translation. The `C` locale assumes that all ``char`` data types are 1 byte and that their value is always less than 256.
8585

8686
At program startup, the equivalent of the following statement is executed:
8787

8888
`setlocale( LC_ALL, "C" );`
8989

90-
The *locale* argument can take a locale name, a language string, a language string and country/region code, a code page, or a language string, country/region code, and code page. The set of available locale names, languages, country/region codes, and code pages includes all those supported by the Windows NLS API. The set of locale names supported by **setlocale** are described in [Locale Names, Languages, and Country/Region Strings](../../c-runtime-library/locale-names-languages-and-country-region-strings.md). The set of language and country/region strings supported by **setlocale** are listed in [Language Strings](../../c-runtime-library/language-strings.md) and [Country/Region Strings](../../c-runtime-library/country-region-strings.md). We recommend the locale name form for performance and for maintainability of locale strings embedded in code or serialized to storage. The locale name strings are less likely to be changed by an operating system update than the language and country/region name form.
90+
The *locale* argument can take a locale name, a language string, a language string and country/region code, a code page, or a language string, country/region code, and code page. The set of available locale names, languages, country/region codes, and code pages includes all those supported by the Windows NLS API. The set of locale names supported by `setlocale` are described in [Locale Names, Languages, and Country/Region Strings](../../c-runtime-library/locale-names-languages-and-country-region-strings.md). The set of language and country/region strings supported by `setlocale` are listed in [Language Strings](../../c-runtime-library/language-strings.md) and [Country/Region Strings](../../c-runtime-library/country-region-strings.md). We recommend the locale name form for performance and for maintainability of locale strings embedded in code or serialized to storage. The locale name strings are less likely to be changed by an operating system update than the language and country/region name form.
9191

92-
A null pointer that's passed as the *locale* argument tells **setlocale** to query instead of to set the international environment. If the *locale* argument is a null pointer, the program's current locale setting isn't changed. Instead, **setlocale** returns a pointer to the string that's associated with the *category* of the thread's current locale. If the *category* argument is **LC_ALL**, the function returns a string that indicates the current setting of each category, separated by semicolons. For example, the sequence of calls
92+
A null pointer that's passed as the *locale* argument tells `setlocale` to query instead of to set the international environment. If the *locale* argument is a null pointer, the program's current locale setting isn't changed. Instead, `setlocale` returns a pointer to the string that's associated with the *category* of the thread's current locale. If the *category* argument is `LC_ALL`, the function returns a string that indicates the current setting of each category, separated by semicolons. For example, the sequence of calls
9393

9494
```C
9595
// Set all categories and return "en-US"
@@ -105,9 +105,9 @@ returns
105105
LC_COLLATE=en-US;LC_CTYPE=en-US;LC_MONETARY=fr-FR;LC_NUMERIC=en-US;LC_TIME=en-US
106106
```
107107

108-
which is the string that's associated with the **LC_ALL** category.
108+
which is the string that's associated with the `LC_ALL` category.
109109

110-
The following examples pertain to the **LC_ALL** category. Either of the strings ".OCP" and ".ACP" can be used instead of a code page number to specify use of the user-default OEM code page and user-default ANSI code page for that locale name, respectively.
110+
The following examples pertain to the `LC_ALL` category. Either of the strings ".OCP" and ".ACP" can be used instead of a code page number to specify use of the user-default OEM code page and user-default ANSI code page for that locale name, respectively.
111111

112112
- `setlocale( LC_ALL, "" );`
113113

@@ -145,7 +145,7 @@ The following examples pertain to the **LC_ALL** category. Either of the strings
145145

146146
- `setlocale( LC_ALL, "<language>" );`
147147

148-
Sets the locale to the language that's indicated by *\<language>*, and uses the default country/region for the specified language and the user-default ANSI code page for that country/region as obtained from the host operating system. For example, the following calls to **setlocale** are functionally equivalent:
148+
Sets the locale to the language that's indicated by *\<language>*, and uses the default country/region for the specified language and the user-default ANSI code page for that country/region as obtained from the host operating system. For example, the following calls to `setlocale` are functionally equivalent:
149149

150150
`setlocale( LC_ALL, "en-US" );`
151151

@@ -159,22 +159,36 @@ The following examples pertain to the **LC_ALL** category. Either of the strings
159159

160160
Sets the code page to the value indicated by *<code_page>*, together with the default country/region and language (as defined by the host operating system) for the specified code page.
161161

162-
The category must be either **LC_ALL** or **LC_CTYPE** to effect a change of code page. For example, if the default country/region and language of the host operating system are "United States" and "English," the following two calls to **setlocale** are functionally equivalent:
162+
The category must be either `LC_ALL` or `LC_CTYPE` to effect a change of code page. For example, if the default country/region and language of the host operating system are "United States" and "English," the following two calls to `setlocale` are functionally equivalent:
163163

164164
`setlocale( LC_ALL, ".1252" );`
165165

166166
`setlocale( LC_ALL, "English_United States.1252");`
167167

168168
For more information, see the [setlocale](../../preprocessor/setlocale.md) pragma directive in the [C/C++ Preprocessor Reference](../../preprocessor/c-cpp-preprocessor-reference.md).
169169

170-
The function [_configthreadlocale](configthreadlocale.md) is used to control whether **setlocale** affects the locale of all threads in a program or only the locale of the calling thread.
170+
The function [_configthreadlocale](configthreadlocale.md) is used to control whether `setlocale` affects the locale of all threads in a program or only the locale of the calling thread.
171+
172+
## UTF-8 Support
173+
174+
Starting in Windows 10 build 17134 (April 2018 Update), the Universal C Runtime supports using a UTF-8 code page. This means that `char` strings passed to C runtime functions will expect strings in the UTF-8 encoding. To enable UTF-8 mode, use "UTF-8" as the code page when using `setlocale`. For example, `setlocale(LC_ALL, ".utf8")` will use the current default Windows ANSI code page (ACP) for the locale and UTF-8 for the code page.
175+
176+
After calling `setlocale(LC_ALL, ".UTF8")`, you may pass "😊" to `mbtowcs` and it will be properly translated to a `wchar_t` string, whereas previously there was not a locale setting available to do this.
177+
178+
UTF-8 mode is also enabled for functions that have historically translated `char` strings using the default Windows ANSI code page (ACP). For example, calling [`_mkdir("😊")`](../reference/mkdir-wmkdir.md) while using a UTF-8 code page will correctly produce a directory with that emoji as the folder name, instead of requiring the ACP to be changed to UTF-8 prior to running your program. Likewise, calling [`_getcwd()`](../reference/getcwd-wgetcwd.md) inside of that folder will return a UTF-8 encoded string. For compatibility, the ACP is still used if the C locale code page is not set to UTF-8.
179+
180+
The following aspects of the C Runtime that are not able to use UTF-8 because they are set during program startup and must use the default Windows ANSI code page (ACP): [`__argv`](../argc-argv-wargv.md), [`_acmdln`](../acmdln-tcmdln-wcmdln.md), and [`_pgmptr`](../pgmptr-wpgmptr.md).
181+
182+
Previous to this support, [`mbrtoc16`, `mbrtoc32`](../reference/mbrtoc16-mbrtoc323.md), [`c16rtomb`, and `c32rtomb`](../reference/c16rtomb-c32rtomb1.md) existed to translate between UTF-8 narrow strings, UTF-16 (same encoding as `wchar_t` on Windows platforms) and UTF-32. For compatibility reasons, these APIs still only translate to and from UTF-8 and not the code page set via `setlocale`.
183+
184+
To use this feature on an OS prior to Windows 10, such as Windows 7, you must use [app-local deployment](../../windows/universal-crt-deployment.md#local-deployment) or link statically using version 17134 of the Windows SDK or later. For Windows 10 operating systems prior to 17134, only static linking is supported.
171185

172186
## Requirements
173187

174188
|Routine|Required header|
175189
|-------------|---------------------|
176-
|**setlocale**|\<locale.h>|
177-
|**_wsetlocale**|\<locale.h> or \<wchar.h>|
190+
|`setlocale`|\<locale.h>|
191+
|`_wsetlocale`|\<locale.h> or \<wchar.h>|
178192

179193
For additional compatibility information, see [Compatibility](../../c-runtime-library/compatibility.md).
180194

docs/c-runtime-library/reference/setmbcp.md

Lines changed: 0 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -34,16 +34,6 @@ Returns 0 if the code page is set successfully. If an invalid code page value is
3434

3535
The **_setmbcp** function specifies a new multibyte code page. By default, the run-time system automatically sets the multibyte code page to the system-default ANSI code page. The multibyte code page setting affects all multibyte routines that are not locale dependent. However, it is possible to instruct **_setmbcp** to use the code page defined for the current locale (see the following list of manifest constants and associated behavior results). For a list of the multibyte routines that are dependent on the locale code page rather than the multibyte code page, see [Interpretation of Multibyte-Character Sequences](../../c-runtime-library/interpretation-of-multibyte-character-sequences.md).
3636

37-
The multibyte code page also affects multibyte-character processing by the following run-time library routines:
38-
39-
||||
40-
|-|-|-|
41-
|[_exec functions](../../c-runtime-library/exec-wexec-functions.md)|[_mktemp](mktemp-wmktemp.md)|[_stat](stat-functions.md)|
42-
|[_fullpath](fullpath-wfullpath.md)|[_spawn functions](../../c-runtime-library/spawn-wspawn-functions.md)|[_tempnam](tempnam-wtempnam-tmpnam-wtmpnam.md)|
43-
|[_makepath](makepath-wmakepath.md)|[_splitpath](splitpath-wsplitpath.md)|[tmpnam](tempnam-wtempnam-tmpnam-wtmpnam.md)|
44-
45-
In addition, all run-time library routines that receive multibyte-character *argv* or *envp* program arguments as parameters (such as the **_exec** and **_spawn** families) process these strings according to the multibyte code page. Therefore, these routines are also affected by a call to **_setmbcp** that changes the multibyte code page.
46-
4737
The *codepage* argument can be set to any of the following values:
4838

4939
- **_MB_CP_ANSI** Use ANSI code page obtained from operating system at program startup.

0 commit comments

Comments
 (0)