Vous êtes sur la page 1sur 16

ESMA 6835 Mineria de Datos

CLASE 2
(basada en notas del Prof V. Kumar)

Dr. Edgar Acuna


Departmento de Matematicas

Universidad de Puerto Rico- Mayaguez


math.uprrm.edu/~edgar

ESMA 6835 Mineria de Datos Edgar Acuna 1


Que es un conjunto de datos?
Atributos
Es una coleccion de objetos
con sus respectivo atributos.
Un atributo es una propiedad Tid Refund Marital Taxable
Status Income Cheat
o caracteristica de un objeto.
Examplos: color de ojos 1
2
Yes
No
Single
Married
125K
100K
No
No
de una persona, peso,
3 No Single 70K No
salario annual, etc.
Un atributo tambien es
4 Yes Married 120K No
5 No Divorced 95K Yes
conocido como variable, Objetos 6 No Married 60K No
caracteristica, o feature. 7 Yes Divorced 220K No
Una coleccion de atributos 8 No Single 85K Yes
describe un objeto. 9 No Married 75K No
Un objeto es tambien 10 No Single 90K Yes
llamado registro, caso,
10

muestra, entidad, o
instancia.
Atributos

Los valores de un atributo son los numeros o


simbolos asignados a un atributo.
Segun la escala de medicion hay cuatro
tipos distintos de atributos: nominal, ordinal,
de intervalo y de razon.

ESMA 6835 Mineria de Datos Edgar Acuna 3


Propiedades de los valores de
atributos
Los tipos de atribnutos dependen de si poseen una de las
siguientes propiedades.
Distincion: =
Orden: < >
Adicion: + -
Multiplicacion: */
Atributo Nominal: distincion
Atributo ordinal: distincion y orden
Atributo de intervalo: distincion, orden y addcion
Atributo de razon: posee las 4 propiedades
ESMA 6835 Mineria de Datos Edgar Acuna 4
Tipo de Descripcion Ejermplos Operaciones
Atributo

Nominal Los valores de estos atributos son Codigos postales, moda, entropia,
solamente nombres distintos. O sea numeros de medidas de
los atributos nominales solo dan identifiacion, color de asociacion, pruebas
informacion para distinguir un ojos, sexo: {male, de 2 .
objeto de otro (=, ) female}

Ordinal Los valores de estos atributos dan Nivel de educacion, mediana,


suficiente informacion para ordenar nivel de empleo, notas, percentiles,
objetos. (<, >) numero de calles correalcion por
rangos, pruebas de
corridas, prueba de
signos
De intervalo Para este tipo de atributos la Temperatura, peso, Media, desviacion
diferencia entre sus valores tienen fechas, etc. estandar,
significado. Es decir existe una correlacion de
unidad de medicion. (+, - ) Pearson, pruebas t y
F.

De razon Las diferrncias y las razones entre Cantidades Media geometrica,


sus valores tiene significadol. (*, /) monetarias, edad, media armonica,
longitud, corriente variacion
electrica. porcentual.
Attribute Transformation Comments
Level

Nominal Any permutation of values If all employee ID numbers


were reassigned, would it
make any difference?

Ordinal An order preserving change of An attribute encompassing


values, i.e., the notion of good, better
new_value = f(old_value) best can be represented
where f is a monotonic function. equally well by the values
{1, 2, 3} or by { 0.5, 1,
10}.
De new_value = f(old_value) Thus, the Fahrenheit and
Intervalo where f is a continuous function. Celsius temperature scales
For linear function f , differ in terms of where
new_value =a * old_value + b their zero value is and the
where a and b are constants size of a unit (degree).

De razon new_value = a * old_value Like a change of scale:


Feet to meters.
Atributos Continuos y discretos
Discrete Attribute
Has only a finite or countably infinite set of values
Examples: number of car sales per day , number of
children in a family, number of certain word in a collection
of documents.
Often represented as integer variables.
Binary attributes are a special case of discrete attributes.
Example: fail, Pass.
Continuous Attribute
Has real numbers as attribute values
Examples: temperature, height, or weight.
In practice, real values can only be measured and
represented using a finite number of digits.
Tipos de conjuntos de datos

Record
Matrices de datos
Datos de documentos
Datos de transacciones
Graph
Redes
Estructuras moleculares

Ordenados
Datos espaciales
Datos temporales
Datos de secuencias geneticas

ESMA 6835 Mineria de Datos Edgar Acuna 8


Record Data

Data that consists of a collection of records, each of which


consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

ESMA 6835 Mineria de Datos Edgar Acuna 9


Data Matrix

If data objects have the same fixed set of numeric attributes,


then the data objects can be thought of as points in a multi-
dimensional space, where each dimension represents a
distinct attribute
Such data set can be represented by an m by n matrix, where
there are m rows, one for each object, and n columns, one for
each attribute
Projection Projection Distance Load Thickness
of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1

ESMA 6835 Mineria de Datos Edgar Acuna 10


Document Data

Each document becomes a `term' vector,


each term is a component (attribute) of the vector,
the value of each component is the number of times
the corresponding term occurs in the document.

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y

Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
ESMA 6835 Mineria de Datos Edgar Acuna 11
Transaction Data

A special type of record data, where


each record (transaction) involves a set of items.
For example, consider a grocery store. The set of
products purchased by a customer during one
shopping trip constitute a transaction, while the
individual products that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

ESMA 6835 Mineria de Datos Edgar Acuna 12


Graph Data

Examples: Generic graph


2
5 1
2
5

ESMA 6835 Mineria de Datos Edgar Acuna 13


Graph Data

Benzene Molecule: C6H6

ESMA 6835 Mineria de Datos Edgar Acuna 14


Ordered Data

Genomic sequence data


GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC Basis
GAGAAGGGCCCGCCTGGCGGGCG A=Adenina
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC C=Citosina
CCCTCTGCTCGGCCTAGACCTGA G=Guanina
GCTCATTAGGCGGCAGCGGACAG T=Tianina
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG

ESMA 6835 Mineria de Datos Edgar Acuna 15


Ordered Data

Spatio-Temporal Data

Average
Monthly
Temperature of
land and ocean

ESMA 6835 Mineria de Datos Edgar Acuna 16