Preprocessing
Transforms the raw input data into an appropriate format for subsequent analysis.
Data Mining
The process of automatically discovering useful information in large data repositories.
Regression
Predictive model used for continuous target variables.
Association analysis
Used to discover patterns that describe strongly associated features in the data.
Attribute
Property or characteristic of an object that may vary
Data cleaning
Detection and correction of data quality.
Standardization
The process of making an entire set of values have a particular property.
Positivity
Distance(x, y) >= 0
Symmetry
Distance(x, y) = Distancye(y, x)
Triangle Inequality
Distance(z, x) <= Distance(y, z) + Distance(x, y)
Nominal
Suppose you are working on analyzing data from a table without IDs. If you want to generate an attribute field with unique IDs for each data object.
[-1, 1]
The range of values that are possible for the cosine similarity measure.
Discrete attribute
Bronze, Silver, and Gold medals as awarded at the Olympics.
Continuous attribute
Angles as measured in degrees between 0 and 360.
Binary attribute
Time in terms of AM or PM.
Euclidean(x, y) = 2
x = <1, 2, 1, 2>, y = <2, 1, 2, 1>
Given a group of product price records <3, 5, 7, 8, 13, 15, 17, 40, 45, 50, 55, 81, 91, 191, 203, 211, 214, 222>
Equal-frequency partitioning : 3, 5, 7, 8, 13, 15; 17, 40, 45, 50, 55, 81; 91, 191, 203, 211, 214, 222
Equal-width partitioning : 3, 5, 7, 8, 13, 15, 17, 40, 45, 50, 55; 81, 91; 191, 203, 211, 214, 222
Clustering (by largest group gap) partioning : 3, 5, 7, 8, 13, 15, 17, 40, 45, 50, 55; 81, 91; 191, 203, 211, 214, 222
SVD
Minimize squared reconstructure.
PCA
capture data variance.
LLE
Preserve local geometry.
Random projections
Preserve pairwise distances.
LDA
Preserve class distrimination.
SVD techniques
Dimensionality reduction, Feature extraction.
SVD Performing
Find the most meaningful basis.
PCA signal of interest
Along the direction with the largest variance.
225, Covariance
The element c_{1, 2} of the covariance matrix C is 225, The value of c_{2, 1} meaning.
Variance along p_1 is bigger than variance along p_2, p_1 is orthogonal to p_2
If p_1 and p_2 are both principal components vectors, what statements are correct about them.
U_transpose
In the derivation of PCA with SVD we exploit one very important property of orthonormal matrices. If U is an orthonormal matrix, then the inverse of U is equal to U_transpose.
K-means algorithm
It can converge to different final clusterings, depending on initial choice of representatives.
It is widely used in practice.
K-means clustering
It depends on why you are clustering the data.
It reduces the effect of outlier-sensitivity
Group average agglomerative clustering algorithm is better than single-link and complete-lin agglomerative clustering algorithms because.
Core point is 4
Set of 1-demensional points [1, 2.1, 2.9, 4.5, 5, 5.6, 6.2, 6.5, 7.1, 9.1, 10, 11, 13.5, 17, 19]. When clustering the points with DBSCAN, using Manhattan distance as the proximity measure, Eps = 1.5, and MinPts = 4 (Excluding the point itself), how many of the points are core points.
Partitional clustering type
A division of objects into non-overlapping subsets.
Hieracrchical clustering types
A set of nested clusters organized as a tree.
Non-exclusive clustering type
In this type of clustering, points may belong to multiple clusters.
Fuzzy clustering type
A point belongs to every cluster with some wieght between 0 and 1, and all weights for one point add up to 1.
Heterogeneous clustering type
Clusters of widely different sizes, shapes, and densities.
Connected component clustering type
If data is represented as graph, then a cluster can be represented.
DBSCAN is most suitable for these data points
Number of data points in an epsilon-neighborhood of p
The density of point p in a density based clustering defined.
'AI Master Degree > Data Mining' 카테고리의 다른 글
Midterm Preparation : Data Mining (0) | 2021.10.14 |
---|---|
Chapter 3. Data preprocessing이란? (0) | 2021.10.12 |
Chapter 2. Data 타입이란? Missing value란? Outliers란? (Data Mining) (0) | 2021.10.12 |
Chapter 1. Data mining이란? (0) | 2021.10.12 |
Chapter 5. Clustering - 02 (0) | 2021.09.29 |