/
LIS512 LIS512

LIS512 - PowerPoint Presentation

pamella-moone
pamella-moone . @pamella-moone
Follow
379 views
Uploaded On 2016-05-05

LIS512 - PPT Presentation

lecture 3 numbers and characters Thomas Krichel 2010 02 17 structure numbers numeric information character information the ASCII set U nicode encoding coda ligatures collations ID: 306183

characters information numbers character information characters character numbers hex number encoding ascii set unicode byte possibilities system bytes ucs

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "LIS512" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

LIS512 lecture 3numbers and characters

Thomas

Krichel

2010

02 – 17Slide2

structurenumbersnumeric informationcharacter informationthe ASCII setUnicodeencodingcodaligatures collationstransliterationsSlide3

introductionWe have seen that databases store records.Records contain fields, fields have values.Here we talk about fundamentally, how do we compose those values.Numerical values are easyString values are harderSlide4

literatureThe library textbooks are hopelessly short and confused about this topic.I have most of what I have here from my own experience.I recommend Wikipedia, it has fascinating articles about these topics. Slide5

all gone to a number

In all modern information system, information is stored to be treated on a computer.

A computer can only deal with numbers.

As a consequence all information has to be converted into a number.

It's a huge job.

Let’s look at the ground, numbers. Slide6

a bit

A bit is the elementary unit of information.

It takes a binary value. We can label it true/false, black/white, +/-, etc.

Every piece of information in all modern information storage systems has to be reduced to a sequence of bits.

We will denote them 0/1 here.Slide7

byte

A byte is a sequence of 8 bits. '00000000' to '11111111'. There are 2 to the power 8, meaning 256 possibilities to write a byte.

If the byte is required to start with 0, then we can only write '0000000' to '01111111'. This leaves us with 2 to the power 7, meaning 128 possibilities. Slide8

hex numbers

Hex numbers contain the usual digits 0 to 9, as well as A to F. A means 10, B means 11, etc F means 15.

One hex number can represent 2 to the power 4, meaning 16

possibilities

(0 to 15).

Two hex numbers can represent 2 to the power 8

possibilities

.Slide9

bytes and hex numbers

Since two hex numbers convene the same number of

possibilities

as a byte a byte is often represented as two hex numbers.

Thus, for example

'00000000

' in binary is 00 in hex,

'

11111111

' in binary is 'FF' in

hex,

'01111111' in binary is ‘7F‘ in hex Slide10

converting information to numbers

A lot of problem in converting information comes from some part of the information encode in some form and some other part in some other from.

Example: “15

Julliet

1923”

vs

“July 17, 1923”

Often such inconsistencies require manual

reformatting

, which is very expensive.Slide11

numerical information

Some information can be converted to a number using a simple

conversion

.

Examples:

A recent point in time is often converted into a number by taking the number of seconds since the first of January 1970.

A date is often written as an ISO date in the form

yyyymmdd

.

yyyy

in the year,

mm

is the month and

dd

the day with leading 0s. Slide12

numerizing

In the design of every information system, it is a good idea to convert information into something that is directly a number.

There are examples where it is possible directly use a number, such as

colours

times and dates

locations. Slide13

another hex number example

Colors on the world wide web follow the red/green/blue color model.

Each color is given as a number #

rrggbb

, where

rr

is the amount of red

gg

is the amount of green and

bb

in the amount of blue. All these numbers are hex numbers. Example

#FFFFFF white

#00FFFF aqua Slide14

non-numerical information

A lot of information is not numerical by its nature. For example

the name of a person

the title of an expression of a work

The information is of a character string nature.

To store character strings in an information system, each character has to be converted to a number.Slide15

character

A character is an indivsible unit of textual information.

Textual information is composed of characters, and nothing else. Slide16

characters and computer

Computers can not deal with characters directly. They can only deal with numbers.

There we need to associate a number with every character that we want to use in an information encoding system.

A character set combines characters with number

.Slide17

ASCII

ASCII is an old character set developed in the United States. It is a seven bit

character

set.

In hex notation, it goes from '00' to '7F'

Because Anglo-Saxon cultural imperialism, the first 128 characters in Unicode are the same as in ASCII

.Slide18

notable characters in ASCII

decimal hex

byte

8 8 08 U+0008 backspace

9 9 09 U+0009 horizontal tab

10 A 0A U+000A line feed

13 D 0D U+000D carriage return

32 20 20

U+0020

space

127 7F

7F

U+007F deleteSlide19

wikipedia notation

Wikipedia denotes every character in the BMP as U+hhhh where h is a hex digit 0-F.

We will follow this notation here. Slide20

UCS / Unicode

UCS is a universal character set.

It is maintained by the International Standards Organization.

Unicode is an industry standard for characters. It is better documented than UCS.

For what we discuss here, UCS and Unicode are the same. Slide21

Basic multilingual plane

This is a name for the first

65536

characters in Unicode.

Each of these characters fits into two bytes and is conveniently represented by four hex numbers.

Even for these characters, there are numerous complications associated with them. Slide22

dashes

figure dash ‒ U+2012 to link numbers without a range

en dash – U+2013 to link numbers with a range

em

dash — U+2014 for interjections in a sentence

minus sign

− U+2212 for

mathematics Slide23

“smart” quotes

U+201c “ is the opening double quote

U+201d ” is the closing

U+2019

is the

apostrophe

The single quote of the

ASCII character

set is considered to be of mixed usage, it should be avoided when a specific use can be done.

Similarly, the double quote of the

ASCII

character set is imprecise. Slide24

spaces

non-breaking space, U+00A0 is used when you want to avoid a

line break

between the two spaced items. For

example

in hyperlink text, it is good practice to replace spaces with non-breaking spaces as to avoid there appearing to be two links.

In whitespace collapsing contents, it can also be use to add extra space.Slide25

beyond ascii, foreign languages

Everything becomes difficult.

As an example consider the characters

o

ő

ö

The latter two can be considered o with diarcitics or as separate characters. Slide26

most problematic: encoding

One issue is how to map characters to numbers.

This is complicated for languages other than English.

But assume UCS/Unicode has solved this.

But this is not the main problem that we have when workingSlide27

encoding

The encoding determines how the numbers of each character should be put into bytes.

If you have a character set that is has one byte for each character, you have no encoding issue.

But then you are limited to 256 characters in your character set.Slide28

fixed-length encoding

If you have a fixed length encoding, all characters take the same number of bytes.

Say for the basic-multilingual plane of unicode, you need two bytes for each character, and then you are limited to that.

If you are writing only ASCII, it appears a waste.Slide29

variable length encoding

The most widely used scheme to encode Unicode is a variable length scheme, called UTF-8.

I will leave out the technical details on how this is.

But it is important to understand that the encoding needs to known and correct.Slide30

ligature

In fine traditional typography, certain characters appear to be linked to each other.

The most command examples in English usage are

fi

, ff, fl,

ffi

,

ffl

.Slide31

ligatures growing up

In certain cases, ligatures have become so common that they have become characters of their own.

A prominent

example

is the German

sz

ligature the

esszet

. It looks a bit like a beta because it is derived from the

fraktur

font of the characters

.

Another example, apparently, is &.Slide32

collations

Collations are topic that is related to characters.

A collation is a sorting order of character strings.

You may think this is trivial, just follow the alphabetic order.

But in many languages, diacritics come to complicate matters. Slide33

example German

Here are the extra letter of German: Ä/ä, Ö/ö, Ü/ü, ß

In German, there are two collations.

DIN 5007-1 “dictionary collation” treats umlauted characters as if they did not have them, and ß as s.

DIN 5007-2 “phonebook collation” treats umlauted as letter and e (ex. ä --> ae), and ß as ss Slide34

When non-English characters are supposed to be entered in a system used by English speaking people, a transliteration might be used.

This can also be the case if the original script may not be commonly understood. An example are Japanese road sign.

Wikipedia lists 20 different ways to do that for Russian, say

. Library of Congress scheme is apparently the most widely used.

transliterationsSlide35

http://openlib.org/home/krichel

Thank you for your attention!

Please switch off machines b4 leaving!