UTF-8 Cygwin

June 1, 2006 - November 18, 2008 (鈴)


UTF-8 Cygwin allows you to use all sorts of characters and file (or path) names allowed in Windows, while keeping binary-compatibility with the current Cygwin. This is done by adoption of UTF-8 for file names and console (i.e. "command prompt") I/O.

In cygwin1.dll, the kernel of Cygwin, a lot of codes operate on file names. I made all such codes use UTF-8 thoroughly to make up UTF-8 Cygwin. With C/C++ preprocessor, I replaced every occurrence of ANSI-Win32 API that operates on a file name by my UTF-8 API, which typically receives a UTF-8 string, decodes it into Unicode, invokes the corresponding Unicode-Win32 API, and encodes the result into UTF-8. Roughly speaking, I built UTF-8 ⇔ Unicode conversion layer between cygwin1.dll and Windows.

I also modified some codes such as in the console I/O directly. Simple preprocessing is not adequate for them.

The format of a shortcut file which implements a Unix-like symbolic link remains unchanged intentionally. So does the the mount table saved to registry. They are compatible bidirectionally with the current ones.

With regard to environment variables and file contents, everything remains unchanged. As for environment variables, this is disputable. As for file contents, it should be kept unchanged; it follows Unix conventions. When you save a file, its name will be converted and passed to Unicode-Win32 API, but its contents will be stored simply as a sequence of bytes.



For Cygwin 1.5.25-15 (Revised)

The following are the patch to the source files (cygwin-1.5.25-15-src.tar.bz2) and a compiled binary along with the md5 sums.

See here for the changes in this revision (in Japanese).


Expand cygwin1-dll-20-11-18.tar.bz2 to get cygwin1.dll; terminate all Cygwin processes; put cygwin1.dll in C:\cygwin\bin. If you have mounted a path whose name includes non-ASCII characters, logoff or reboot once after you put cygwin1.dll in C:\cygwin\bin.

To get the complete source:

  1. Get cygwin-1.5.25-15-src.tar.bz2 via setup.exe of Cygwin, or by accessing to mirror sites via ftp/http directly.
  2. 01:~$ tar xf cygwin-1.5.25-15-src.tar.bz2
    01:~$ cd cygwin-1.5.25-15/winsup
    01:~/cygwin-1.5.25-15/winsup$ bzcat ~/winsup-utf8-patch-20-11-18.diff.bz2 | patch -p0
    patching file ./cygwin/cygwin.sc
    patching file ./cygwin/fhandler.h
    patching file ./cygwin/fhandler_console.cc
    patching file ./cygwin/miscfuncs.cc
    patching file ./cygwin/path.cc
    patching file ./cygwin/spawn.cc
    patching file ./cygwin/winsup.h

It is under the same licence as the original. Use it freely.

cystart command

Commands that bypass cygwin1.dll and invoke Windows API directly also need modifying to adopt UTF-8. Here is a modified cygstart command, the most popular one among commands of such sort. The following are the modified source file from cygutils-1.3.2-1-src.tar.bz2 and a compiled binary.

Expand cygstart-exe-12-15.tar.bz2 to get cygstart.exe; put it in /bin/ (it is C:\cygwin\bin really).

It is under the same licence as the original. Use it freely.

Setting files as an example

Put .bashrc, .inputrc, and .vimrc in your home directory. Put sitecustomize.py in /usr/lib/python2.5/site-packages/ if you have installed python.

Note: these are the files I use now actually; not all of the contents relate to UTF-8. Please edit them in the way you like. Probably you should delete the last line or two from the .vimrc unless you speak Japanese.


The default setting of Cygwin lacks the 8-bit transparency. You can use the above example to fix it.

The usage is basically the same as that of the current Cygwin, except that the character encoding for console I/O and file names is UTF-8 now. The main advantage is that you can use all characters allowed in Windows from Cygwin's POSIX API without restrictions. In contrast, you have been allowed to use only a subset of characters which are expressible in the default code page and consistent with ASCII so far in the current Cygwin. However, note that the console may not display some Unicode characters well. It depends on fonts. Change the setting of Windows or use other terminal programs if necessary.

Charset designation by locale within the "libc" layer is not yet supported (just like in BeOS or Mac OS X 10.2 and earlier). Thus you would have difficulties in editing a bash command line since readline does not recognize UTF-8 as such.


Default code page

If a conversion from UTF-8 to Unicode is failed for a string to be passed to Windows API, then a conversion from the default code page to Unicode will be attempted.

Thus you can read texts in the default code page almost transparently if you are in an environment where the typical bit pattern of characters in the default code page differs from that in UTF-8 widely. (An example: CP932 in Japanese environment)

Mac OS X

In converting a string to Unicode to be passed to Windows API, a combination of a HIRAGANA/KATAKANA letter and a COMBINING KATAKATA-HIRAGANA (SEMI-)VOICED MARK will be replaced with a (semi-)voiced HIRAGANA/KATAKANA letter, because both are equivalent in Unicode and the latter happens to be used in Windows. Similar processing will be performed on letters in Latin-1, Latin-2, Latin with macron, and Esperanto.

Practically, this prevents mistranslation of file names when you expand a tar file taken from Mac OS X, where the former combination happens to be used.

Copyright (c) 2006, 2007, 2008 OKI Software Co., Ltd.