Data Mining 기본 용어 정리

Preprocessing

Transforms the raw input data into an appropriate format for subsequent analysis.

Data Mining

The process of automatically discovering useful information in large data repositories.

Regression

Predictive model used for continuous target variables.

Association analysis

Used to discover patterns that describe strongly associated features in the data.

Attribute

Property or characteristic of an object that may vary

Data cleaning

Detection and correction of data quality.

Standardization

The process of making an entire set of values have a particular property.

Positivity

Distance(x, y) >= 0

Symmetry

Distance(x, y) = Distancye(y, x)

Triangle Inequality

Distance(z, x) <= Distance(y, z) + Distance(x, y)

Nominal

Suppose you are working on analyzing data from a table without IDs. If you want to generate an attribute field with unique IDs for each data object.

[-1, 1]

The range of values that are possible for the cosine similarity measure.

Discrete attribute

Bronze, Silver, and Gold medals as awarded at the Olympics.

Continuous attribute

Angles as measured in degrees between 0 and 360.

Binary attribute

Time in terms of AM or PM.

Euclidean(x, y) = 2

x = <1, 2, 1, 2>, y = <2, 1, 2, 1>

Given a group of product price records <3, 5, 7, 8, 13, 15, 17, 40, 45, 50, 55, 81, 91, 191, 203, 211, 214, 222>

Equal-frequency partitioning : 3, 5, 7, 8, 13, 15; 17, 40, 45, 50, 55, 81; 91, 191, 203, 211, 214, 222

Equal-width partitioning : 3, 5, 7, 8, 13, 15, 17, 40, 45, 50, 55; 81, 91; 191, 203, 211, 214, 222

Clustering (by largest group gap) partioning : 3, 5, 7, 8, 13, 15, 17, 40, 45, 50, 55; 81, 91; 191, 203, 211, 214, 222

SVD

Minimize squared reconstructure.

PCA

capture data variance.

LLE

Preserve local geometry.

Random projections

Preserve pairwise distances.

LDA

Preserve class distrimination.

SVD techniques

Dimensionality reduction, Feature extraction.

SVD Performing

Find the most meaningful basis.

PCA signal of interest

Along the direction with the largest variance.

225, Covariance

The element c_{1, 2} of the covariance matrix C is 225, The value of c_{2, 1} meaning.

Variance along p_1 is bigger than variance along p_2, p_1 is orthogonal to p_2

If p_1 and p_2 are both principal components vectors, what statements are correct about them.

U_transpose

In the derivation of PCA with SVD we exploit one very important property of orthonormal matrices. If U is an orthonormal matrix, then the inverse of U is equal to U_transpose.

K-means algorithm

It can converge to different final clusterings, depending on initial choice of representatives.
It is widely used in practice.

K-means clustering

It depends on why you are clustering the data.

It reduces the effect of outlier-sensitivity

Group average agglomerative clustering algorithm is better than single-link and complete-lin agglomerative clustering algorithms because.

Core point is 4

Set of 1-demensional points [1, 2.1, 2.9, 4.5, 5, 5.6, 6.2, 6.5, 7.1, 9.1, 10, 11, 13.5, 17, 19]. When clustering the points with DBSCAN, using Manhattan distance as the proximity measure, Eps = 1.5, and MinPts = 4 (Excluding the point itself), how many of the points are core points.

Partitional clustering type

A division of objects into non-overlapping subsets.

Hieracrchical clustering types

A set of nested clusters organized as a tree.

Non-exclusive clustering type

In this type of clustering, points may belong to multiple clusters.

Fuzzy clustering type

A point belongs to every cluster with some wieght between 0 and 1, and all weights for one point add up to 1.

Heterogeneous clustering type

Clusters of widely different sizes, shapes, and densities.

Connected component clustering type

If data is represented as graph, then a cluster can be represented.

DBSCAN is most suitable for these data points

Number of data points in an epsilon-neighborhood of p

The density of point p in a density based clustering defined.

'AI Master Degree > Data Mining' 카테고리의 다른 글

Midterm Preparation : Data Mining (0)	2021.10.14
Chapter 3. Data preprocessing이란? (0)	2021.10.12
Chapter 2. Data 타입이란? Missing value란? Outliers란? (Data Mining) (0)	2021.10.12
Chapter 1. Data mining이란? (0)	2021.10.12
Chapter 5. Clustering - 02 (0)	2021.09.29