Using UTF-8 (Unicode) in Gentoo

From Elvanör's Technical Wiki
Jump to navigation Jump to search

Using UTF-8 in your Gentoo system is absolutely mandatory for many reasons... This short guide contains some links to the official Gentoo UTF-8 documentation and also discussions about some potential issues with UTF-8.

Most important stuff: Gentoo documentation, and setting your locale

There are several manipulations you need to do in order to have a full UTF-8 system. Basically, however, there are 5 steps:

  • Building UTF-8 support in your kernel;
  • Generating and using a UTF-8 locale (with a UTF-8 enabled glibc) - edit /etc/env.d/02locale and set LANG to your desired locale;
  • Add the "unicode" flag to your USE flags in /etc/make.conf.
  • Add Unicode support to the console (emerge an Unicode font and edit some configuration files).

The two following links will explain all that in more details.

Warnings

  • Setting the correct locale is EXTREMELY important (especially if your current locale is not present on the system because locale-gen does not list it). I had very important problems with nano not editing files correctly (was probably an UTF problem) because the locale was not defined correctly. So double check that your locale is correct!
  • locale -a can be very useful as it gives the exact names of the locales available on the system. Note that strangely this can differ from what is written in locale.gen. If you change a particular locale setting (in my case, LC_TIME for instance), the string used must match *exactly* the one outputed by locale -a. Else applications will emit warnings that the locale cannot be found (and indeed, they will fallback to the default "C" locale).
  • KDE can interfere with the locale for a given user. I had an instance of KDE generated scripts in ~/.config (namely, plasma-locale-settings.sh and plasma-localerc) that were tampering with the system wide locale. Removing the script solved the problem.

UTF-8, ISO 9660 and Joliet extensions

A current problem I have is that I did not manage to burn a CD/DVD with filenames in UTF-8. In K3b, checking the option "Generate Rock Ridge extensions" creates a working DVD under Linux (eg, the filenames appear correctly). However, under Windows and Mac OS X the same DVD does not work (filenames appear with garbage characters).

This is because Linux uses the Rock Ridge extensions, whereas Windows/OS X must use the poorly designed Joliet extensions. It seems that currently mkisofs (part of cdrtools) can not deal with an input-charset of UTF-8, or at least the stable version in Gentoo cannot. But I don't know if support could be added, or if it is impossible because of some Joliet limitations. Joliet seems to use UTF-16 encoding for filenames.

Anyway, the current situation is that I can have DVDs with UTF-8 filenames, but they work correctly only with Linux. Slightly annoying. Maybe with more recent versions of cdrtools or cdrkit (the Debian fork of cdrtools), this issue will go away.

Update, March 2007: with cdrkit and K3b 1.0, the problems go away. cdrkit produces perfectly readable CDs (at least for ISO-8859-1) under both Windows and Linux. Currently cdrkit is not the default in Portage (it should really be), so you must manually unmerge cdrtools and emerge cdrkit.

Unicode and LaTeX (tetex)

  • It seems hard to make unicode work correctly with LaTeX. There are two ways to use Unicode in LaTeX: \usepackage[utf8]{inputenc} and \usepackage[utf8x]{inputenc}. For utf8x, you must emerge the package latex-unicode. I think it is preferable to use the utf8 package.
  • After that, here is some LaTeX code showing how to setup utf-8 in a .tex file:
\usepackage[utf8]{inputenc}
\usepackage[russian,french]{babel}

\begin{document}

\selectlanguage{russian}

Here is some Russian: руский.

\selectlanguage{french}

Voilà du français.
  • The problem is that if you select the Russian language, while running pdflatex the resulting PDF will contain bitmap text (eg, text with not be rendered mathematically). I did not have this problem under Mac OS X. I am currently searching for a solution that would create beautiful PDFs with several languages used.
  • Update: Gentoo, as of March 2007, still uses the tetex TeX distribution. This distribution is outdated. The future of TeX on UNIX is apparently LiveTex. I hope that Gentoo makes the transition soon; in the mean time, some problems will be hard to solve with tetex which is unmaintained.

Keyboards and input methods

  • To obtain a "ç" (cédille) in a French KDE keyboard layout, type "Alt Gr" + ",".

Obsolete information

  • Add a script to your default runlevel: UPDATE: this no longer seems to be needed.
#!/sbin/runscript
conf=/etc/env.d/02locale

# Using devfs?
if [ -e /dev/.devfsd ] || [ -e /dev/.udev -a -d /dev/vc ]; then
  device=/dev/vc/
else
  device=/dev/tty
fi

depend() {
        need localmount
        after keymaps
        before consolefont
}

checkconfig() {

  if [ -r ${conf} ]; then
          . ${conf}
          encoding=
          [ -n "${LC_ALL}" ]      && encoding=${LC_ALL#*.}   && return 0
          [ -n "${LC_MESSAGES}" ] && encoding=${LC_MESSAGES#*. } && return 0
          [ -n "${LANG}" ]        && encoding=${LANG#*.}   && return 0
  fi
  eend 1 "Locale is not configured, Please fix ${conf}"
  return 1
}

start() {
        ebegin "Setting consoles to UTF-8"
        checkconfig
        if [[ "${encoding}" =~ "[uU][tT][fF]-?8" ]]; then
                dumpkeys | loadkeys --unicode
                for ((i=1; i <= "${RC_TTY_NUMBER}"; i++)); do
                        echo -ne "\033%G" > ${device}${i}
                done
                eend 0
        else
                eend 1 "UTF-8 is not required"
        fi
}