Lasso Soft Inc. > Home

  • Articles

Character Encoding and Unicode Workflow

Taken from the tip of the week for June 22, 2007 this article discusses character encoding in Lasso and recommends best practices for a complete Unicode workflow.

Introduction

When Lasso processes a page data is collected from myriad sources, usually as text or string data, processed using LassoScript tags, and then the results are distributed to one or more destinations.

A simple Lasso page load consists of parsing an incoming URL, reading a Lasso page from disk, processing the tags within it, and then transmitting the results to the site visitor's Web browser. Most page loads will also make use of additional sources of data including database searches, URL includes, parsing XML files, even messages downloaded from POP servers. And, many page loads will send data to multiple destinations by adding or updating records in a database or sending email messages in addition to sending a page to the site visitor.

At its heart, Lasso is a Unicode text processing engine. All string operations in Lasso are performed using Unicode routines on UCS-16 data. Lasso must translate all incoming data to this Unicode format so that it can be processed. When data is output from Lasso it must be translated to the destination's preferred data source.

This tip discusses how data flows through Lasso and where character set conversions are performed. It includes recommendations for best practices to ensure the best data fidelity possible.

Serving a Page

When Lasso serves a Web page it first loads a Lasso page into memory, processes the scripts embedded within the page, and then transmits the HTML reply to the visitor's browser.

Lasso reads pages from disk differently depending on what platform Lasso is running and whether the page starts with a Unicode Byte Order Mark (UTF BOM). If a page starts with a BOM then Lasso reads it as Unicode data. Lasso can read pages in UTF-8, UTF-16, or UTF-32 and can automatically recognize little or big endian systems. If no BOM is detected then Lasso assumes file on Mac OS X are encoded using the Mac-Roman character set and that files on Windows or Linux are encoded using the Latin-1 (ISO-8859-1) character set.

Best Practice - Encode all Lasso pages in Unicode with an appropriate BOM.

Common Practice - Encode pages in Mac-Roman on Mac OS X and in Latin-1 on Windows or Red Hat Linux.

Avoid - Encoding pages in any other character set or in a Unicode character set without a BOM.

Lasso serves all Web pages in UTF-8 by default. Lasso sets a Content-Type header which informs the browser what character set the data is being transmitted in. All modern Web browsers support UTF-8 and will transmit form data back to Lasso in the same format. Lasso will degrade gracefully to whatever character sets a client supports. For example, if an older client only accepts ISO-8859-1 and specifies so in the HTTP header sent to Lasso, then Lasso will transmit data to the client in that character set.

Lasso's "Default Page Encoding" can be modified in Lasso Site Administration on the Setup > Site > Lasso Settings page. The page encoding for the current page can be modified by changing the [Content_Type] of the page. Lasso will attempt to transmit the data using whatever character set is specified in this header. [Content_Type: 'text/html; charset=iso-8859-1']

Best Practice - Use Lasso's default UTF-8 encoding for all pages that are served to site visitors.

Common Practice - For compatibility with some older browsers it may be necessary to set Lasso's server-wide "Default Page Encoding" to "iso-8859-1" or another character set.

Avoid - Using different character sets on different pages unless absolutely necessary. Never use META tags to specify what character set is being served through Lasso. Since Lasso includes the character set in the HTTP header it is always unnecessary and inadvisable to include a META tag specifying a character set in your Web pages. At best the character set in the META tag matches the header Lasso is already sending. At worst, and most commonly, the META tag specifies a different character set than Lasso is using and the browser will not know how to interpret the results.

Form Submissions

Form submissions require a few additional notes. As indicated above all browsers will transmit form data in the same character set as the received page. Lasso always assumes that form data is coming in its "Default Page Encoding". This works great until you use a [Content_Type] to manually change a character set or you transmit a form to Lasso from a client which only supports a limited number of character sets.

A hidden input with name -FormContentType and the name of a character set as its value can be used to tell Lasso what character set to read values from the form in. For example, if you have a client which can only transmit ISO-8859-1 characters then you can include a hidden input which tells Lasso this charcter set is being used. Similarly, if you are using [Content_Type] to set the character set Lasso is using to serve data you should use a matching -FormContentType parameter to let Lasso know what character set data is coming back in.

Finally, Lasso has a fail-safe mode it will use if it can't decode the characters in a form submission using the UTF-8 default. If a form can't be decoded as UTF-8 then Lasso will attempt to decode it as ISO-8859=1. This allows most older clients to connect to Lasso without any additional steps.

Best Practice - Use Lasso's default UTF-8 encoding for all pages that are served to site visitors. Forms will be returned to Lasso in UTF-8. Lasso's fail-safe will allow forms from older browsers to work properly as long as they are sent in ISO-8859-1 format.

Common Practice - For compatibility with some older browsers it may be necessary to set Lasso's server-wide "Default Page Encoding" to "iso-8859-1" or another character set.

Avoid - Using different character sets on different pages unless absolutely necessary.

URL Encoding

Incoming URLs are parsed by Lasso using much the same rules as for incoming form data. Lasso assumes that form data is coming in its "Default Page Encoding". When the page encoding is the default of UTF-8, Lasso is able to recognize both UTF-8 encoding and also has a fail-safe mode where it will use ISO-8859-1 encoding.

The [Encode_URL] tag will encode URLs using the "Default Page Encoding" or the current [Content_Type] override. You can tell UTF-8 encoding is being used since most extended ASCII characters will result in two %## entities in the encoded string. You can force [Encode_URL] to output using different encoding by using the [Bytes] type to convert the string to a specific character set.

[Encode_URL: 'émigré']  %C3%A9migr%C3%A9

[Encode_URL: (bytes: 'émigré', 'iso-8859-1')] %E9migr%E9

Best Practice - Use Lasso's default UTF-8 encoding for all URLs that will be sent to either the current server or other servers. Lasso will automatically recognize the vast majority of URLs from other servers in either UTF-8 or ISO-8859-1.

Common Practice - For compatibility with some older Web applications it may be necessary to encode URLs which will be sent to that server using ISO-8859-1 or another character set. The [Bytes] tag should be used in concert with the [Encode_URL] tag to create URLs for these applications.

Avoid - Using different character sets when encoding URLs unless absolutely necessary.

MySQL

All of Lasso's communication with MySQL data sources is through strings. Since Lasso processes all string data internally using Unicode it is often necessary to translate character encoding both when transmitting SQL statements to MySQL and when interpreting search results.

MySQL 4.0 and earlier handled character encoding naively. They stored whatever data they were given as if it was ISO-8859-1 data. It was possible to store and retrieve UTF-8 data through Lasso, but when that data was fetched by other clients it could be misinterpreted.

MySQL 4.1 and higher handle character sets natively. Each MySQL table now has a character encoding attached to it. MySQL tables default to ISO-8859-1, but can also be set to hold UTF-8 data or any character set.

There are two ways to specify what encoding Lasso will use when communicating with MySQL:

If the "Use MySQL 4.1/5.x Character Sets" option is set to Yes in the Setup > Data Sources > Hosts section of Lasso Site Administration for a MySQL host then Lasso will transmit all data to MySQL in UTF-8 and allow MySQL to automatically encode that data according to the table definition.

Otherwise, Lasso will use the encoding defined in the Setup > Data Sources > Tables section of of Lasso Site Administration for the -Table specified in an [Inline]. If no -Table is specified then ISO-8859-1 will be used. Note that a "Table Batch Change" option in Lasso Site Administration allows all the encoding for all tables on a given host or database to be changed at once.

Best Practice - Set "Use MySQL 4.1/5.x Character Sets" to Yes and allow MySQL to handle the proper encoding for each table.

Common Practice - Ensure that the encoding of each table in MySQL and the corresponding table setting in Lasso are the same. Always specify a -Table parameter in each inline including those which use -SQL statements.

Avoid - Sending data from Lasso in one character set when MySQL is expecting a different character set. Avoid storing UTF-8 data in MySQL 4.0 or earlier since it will make upgrading more difficult.

See the following pages for information about how to upgrade MySQL 4.0 or earlier databases which have data stored in a character set other than ISO-8859-1.

http://dev.mysql.com/doc/refman/4.1/en/charset-upgrading.html http://dev.mysql.com/doc/refman/4.1/en/charset-conversion.html

JDBC, FileMaker Server Advanced, and SQLite

These data sources always use UTF-8 character encoding so generally nothing special needs to happen when using these data sources through Lasso. However, it is possible for Lasso to assume an ISO-8859-1 character set when new tables are discovered while Lasso is running.

Lasso will automatically set all tables in JDBC, FileMaker Server Advanced, or SQLite to use UTF-8 encoding when a new database is enabled in Lasso Site Administration. Lasso will also automatically correct the encoding of all tables in these data sources when each Lasso site starts up.

However, if a database has already been enabled, Lasso is running, and a new table is added to the database then Lasso will assume ISO-8859-1 on the table until either the database is updated in Lasso Site Administration or the Lasso site is restarted.

Best Practice - Use the default UTF-8 communication for JDBC, FileMaker Server Advanced, and SQLite data sources. Either use Lasso Site Administration to update the settings for a database or restart the Lasso site after new tables are added to a database which is already enabled.

Avoid - Setting the table encoding for tables in these data sources to anything other than UTF-8. The table encoding will only hold until the Lasso site is restarted and will always result in bad data being stored in the database.

FileMaker Pro

FileMaker Pro data sources use the Mac-Roman character set when FileMaker Pro is running on Mac OS X (or earlier) and the ISO-8859-1 character set when FileMaker Pro is running on Windows. When Lasso and FileMaker Pro are running on the same platform no special actions are required since Lasso will automatically use the proper encoding.

If Lasso is running on one platform and FileMaker is running on the other then the "Do ISO/Mac Conversion" setting found in the Setup > Data Sources > Database section of Lasso Site Administration should be set to Yes.

Best Practice - Set "Do ISO/Mac Conversion" to Yes if Lasso and FileMaker Pro are running on different platforms or No if Lasso and FileMaker Pro are running on the same platform.

Other Data Sources

All other data sources (Microsoft SQL Server, Oracle, OpenBase, PostgreSQL, ODBC, and others) default to ISO-8859-1 character set by default. Lasso will use the encoding specified in Lasso Site Administration for the -Table specified in the [Inline] tag. Note that a -Table should be specified even for inlines with -SQL statements.

The encoding for an individual table can be set in the Setup > Data Sources > Tables section of Lasso Site Administration. The encoding for all of the tables in a data source host or a given database can be batch changed on the Hosts or Databases sections of Lasso Site Administration.

Best Practice - Check the documentation for your data source and ensure that each table is set to use the proper encoding. Always use a -Table in each inline to ensure that Lasso uses the proper encoding, even if the inline specifies a -SQL statement.

Avoid - Specifying a different character set than the data source can handle natively. This can make it difficult to extract data from the data source using tools other than Lasso since the encoding of the data will not match what is expected.

More Information More information about all of the tags used in this tip of the week can be found in the Lasso 8.5 Language Guide or in the online Lasso Reference

Author: Fletcher Sandbeck
Created: 22 Jun 2007
Last Modified: 9 Jun 2011

Please note that periodically LassoSoft will go through the notes and may incorporate information from them into the documentation. Any submission here gives LassoSoft a non-exclusive license and will be made available in various formats to the Lasso community.

LassoSoft Inc. > Home

 

 

©LassoSoft Inc 2015 | Web Development by Treefrog Inc | PrivacyLegal terms and Shipping | Contact LassoSoft