Help!

default character encoding for everything in debian

 
  

Goto page Previous  1, 2
Post new topic   General Reply to Topic (not reply to a specific post)    Forums Home -> Development RSS
Next:  Accepted emacs-jabber 0.7.93-2 (source all)  
Author Message
Roger Leigh
External


Since: Jan 11, 2009
Posts: 15



PostPosted: Wed Aug 12, 2009 7:10 am    Post subject: Re: default character encoding for everything in debian [Login to view extended thread Info.]
Archived from groups: linux>debian>devel (more info?)

On Wed, Aug 12, 2009 at 09:56:49AM +0200, Samuel Thibault wrote:
> Giacomo A. Catenazzi, le Wed 12 Aug 2009 08:03:30 +0200, a écrit :
> > Bastian Blank wrote:
> > > On Tue, Aug 11, 2009 at 09:40:35PM +0200, Bernd Eckenfels wrote:
> > >> In article <20090811183800.GE5487 DeleteThis @const.famille.thibault.fr> you wrote:
> > >>> Not necessarily. Any sane implementation should just use wchar_t
> > >> Which could be UTF16 and therefore still has complicatd length semantics.
> > >
> > > No, wchar_t is UCS-4 (or UCS-2 in esoteric implementations like
> > > Windows).
> >
> > No wchar_t is locale dependent (per POSIX).
>
> What do you mean? The compiler can't know the locale in advance for
> the width and endianness. The value might depend on the locale, yes,
> but that's not a problem as long as you convert into UTF-8 before
> communicating with other applications.
>
> One same systems (Debian systems are), it's just always UCS-4.

Specifically, __STDC_ISO_10646__ is defined to indicate that wchar_t
is always UCS-4 in all locales.

> > BTW on gcc:
> >
> > -fwide-exec-charset=charset
> > Set the wide execution character set, used for wide string and
> > character constants.
>
> It hurts when I shoot myself in the foot.

This feature of GCC is one of the more obscure areas of locale
handling. How does the encoding of strings at the level of
individial translation units work with a single per-process
global locale and C formatted I/O? Curious minds would like to
know!

> > The default is UTF-32 or UTF-16, whichever corresponds to the width of
> > wchar_t.
>
> This documentation is bogus BTW. It should read "UCS-4 or UCS-2".

It's "strictly" correct according to the standard.
http://en.wikipedia.org/wiki/UTF-32/UCS-4 for an overview.


Regards,
Roger

--
.''`. Roger Leigh
: :' : Debian GNU/Linux http://people.debian.org/~rleigh/
`. `' Printing on GNU/Linux? http://gutenprint.sourceforge.net/
`- GPG Public Key: 0x25BFB848 Please GPG sign your mail.


--
To UNSUBSCRIBE, email to debian-devel-REQUEST DeleteThis @lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster DeleteThis @lists.debian.org
Back to top
Roger Leigh
External


Since: Jan 11, 2009
Posts: 15



PostPosted: Wed Aug 12, 2009 8:10 am    Post subject: Re: default character encoding for everything in debian [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

On Wed, Aug 12, 2009 at 07:54:33AM +0200, Giacomo A. Catenazzi wrote:
> Samuel Thibault wrote:
> > Gunnar Wolf, le Tue 11 Aug 2009 13:28:08 -0500, a écrit :
> >> while length(str) in any language up to the 1990s was a mere
> >> substraction, now we must go through the string checking each byte to
> >> see if it is a Unicode marker and substract the appropriate number of
> >> bytes.
> >
> > Not necessarily. Any sane implementation should just use wchar_t and
> > substraction gets back.
>
> An implementation that use wchar_t is usually not sane, but usually
> it is (also) buggy. It is very difficult (AFAIK not impossible,
> but I'm not so sure) to write portable (POSIX way, so with changing
> locales) programs using wchar_t.

Do you have any concrete examples to back up these assertions?

They worked perfectly well for me last time I checked. There were
bugs in the distant past, but I don't see any issues with current
GCC/libc.

BTW, since POSIX/SUS are a superset of the standard C library, they
contain all of the same wide character handling functionality. I'm
not sure what you're getting at with the "changing locales"; SUS
locale functionality like setlocale() comes directly from C with no
changes.


Regards,
Roger

--
.''`. Roger Leigh
: :' : Debian GNU/Linux http://people.debian.org/~rleigh/
`. `' Printing on GNU/Linux? http://gutenprint.sourceforge.net/
`- GPG Public Key: 0x25BFB848 Please GPG sign your mail.


--
To UNSUBSCRIBE, email to debian-devel-REQUEST.TakeThisOut@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster.TakeThisOut@lists.debian.org
Back to top
Thomas Koch
External


Since: Jan 16, 2009
Posts: 14



PostPosted: Wed Aug 12, 2009 8:10 am    Post subject: Re: default character encoding for everything in debian [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

It's impressing how quickly threads on this list grow big. Smile

I'm not sure, whether a conclusion is already reached.

1. apt-get install mysql
2. enter mysql client
3. create database test; create table test( test char(10) );

Replace mysql with whatever application you like.

What should be the encoding of database and table test in cases like the
above?

Currently it's iso-something, discriminating everybody from other countries.
If it would be utf-8 instead, it would have at least two advantages

- The clueless user would get a sane default
- utf-8 isn't as discriminating as iso-8859-1

Best regards,

Thomas Koch

> Hi,
>
> I've an issue, that I forgot to set the character encoding of tomcat to
> utf-8 after reinstalling a server.
> Now, before I report a wishlist(?) bug to tomcat, I want to ask (and invite
> to discuss) shouldn't utf8 be the default character set everywhere? So when
> installing a package from Debian I can assume that where a character
> encoding can be set, it't set to utf8.
> MySQL would be another example, which to my knowledge uses isoXYZ as
> default character encoding.
>
> Best regards,
>
> Thomas Koch, http://www.koch.ro

Thomas Koch, http://www.koch.ro


--
To UNSUBSCRIBE, email to debian-devel-REQUEST.DeleteThis@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster.DeleteThis@lists.debian.org
Back to top
Roger Leigh
External


Since: Jan 11, 2009
Posts: 15



PostPosted: Wed Aug 12, 2009 9:10 am    Post subject: Re: default character encoding for everything in debian [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

On Wed, Aug 12, 2009 at 01:18:12PM +0200, Thomas Koch wrote:
> I'm not sure, whether a conclusion is already reached.
>
> 1. apt-get install mysql
> 2. enter mysql client
> 3. create database test; create table test( test char(10) );
>
> Replace mysql with whatever application you like.
>
> What should be the encoding of database and table test in cases like the
> above?
>
> Currently it's iso-something, discriminating everybody from other countries.
> If it would be utf-8 instead, it would have at least two advantages
>
> - The clueless user would get a sane default
> - utf-8 isn't as discriminating as iso-8859-1

UTF-8 is the sane default choice in this situation, so long as MySQL
is capable of handling it.


Regards,
Roger

--
.''`. Roger Leigh
: :' : Debian GNU/Linux http://people.debian.org/~rleigh/
`. `' Printing on GNU/Linux? http://gutenprint.sourceforge.net/
`- GPG Public Key: 0x25BFB848 Please GPG sign your mail.


--
To UNSUBSCRIBE, email to debian-devel-REQUEST RemoveThis @lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster RemoveThis @lists.debian.org
Back to top
Samuel Thibault
External


Since: May 08, 2009
Posts: 47



PostPosted: Wed Aug 12, 2009 11:10 am    Post subject: Re: default character encoding for everything in debian [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

Roger Leigh, le Wed 12 Aug 2009 11:30:50 +0100, a écrit :
> > > The default is UTF-32 or UTF-16, whichever corresponds to the width of
> > > wchar_t.
> >
> > This documentation is bogus BTW. It should read "UCS-4 or UCS-2".
>
> It's "strictly" correct according to the standard.
> http://en.wikipedia.org/wiki/UTF-32/UCS-4 for an overview.

« except that the UTF-32 standard has additional Unicode
semantics. »

In UTF-32 mode, gcc introduces a BOM, and in UTF-16 it allows without
warnings characters after U+FFFF.

Samuel


--
To UNSUBSCRIBE, email to debian-devel-REQUEST.DeleteThis@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster.DeleteThis@lists.debian.org
Back to top
Osamu Aoki
External


Since: Nov 10, 2004
Posts: 409



PostPosted: Fri Aug 14, 2009 4:10 pm    Post subject: Re: default character encoding for everything in debian [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

Hi,

(I want to see as much UTF-8 support. These days, it is not bad. Try
using "sed" with UTF-8. It works! Of course with some understandable
gliches.)

On Mon, Aug 10, 2009 at 08:55:27PM +0200, Norbert Preining wrote:
> On Mo, 10 Aug 2009, Roger Leigh wrote:
> > Of course there's a penalty for certain operations. But UTF-8 is about
> > as compact as an extended encoding is going to get.
>
> Rubbish. You know why in Japan and other Asian countries UTF8 is not
> so common? Because many of their glyphs need 4 (four!) bytes, while
> for example jis-2022 (AFAIR) is much more compact.

Hmmm... not the best example here, ... technically if you are talking
size. We got too many encodings for Japanese. You see too many ESC
code for jis-2022.

> We are not living in an ASCII world anymore.

True.

Our choice of encoding is not much to do with size. It is inertia and
backward compatibility.

FACTS:

Many Japanese e-mail uses jis-2022 for compatibility. (E-mail was safe
only for 7 bit data in old days).

As far as data size goes, compact popular ones are EUC(Unix) or S-JIS(MS
system). These are used in web pages etc. still. These are as small as
UTF-16/UCS-2 used for many Unicode data internally.

But please note new MAC and XP/Vista/... use Unicode and I see many
files can be in UTF-8. So things are changing.

Osamu


--
To UNSUBSCRIBE, email to debian-devel-REQUEST.RemoveThis@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster.RemoveThis@lists.debian.org
Back to top
Display posts from previous:   
Post new topic   General Reply to Topic (not reply to a specific post)    Forums Home -> Development All times are: Eastern Time (US & Canada) (change)
Goto page Previous  1, 2
Page 2 of 2

 
You can post new topics in this forum
You can reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum