Term
Factors behind the sudden popularity in data mining
|
|
Definition
- Reduction in cost and increased hardware capacity.
- Companies are finding new tools and data as they mine.
- Data can be analyzed from a more complete view.
|
|
|
Term
Examples of applications of data mining |
|
Definition
- Discover new drugs and identify successful therapies
- Reduce fraudulent behavior
- See customer buying patterns
- Reclaim profitable customers
- Better target customers/clients
|
|
|
Term
Definition and characteristics of data mining |
|
Definition
Used to describe knowledge discovery in databases.
-Uses statistical, mathematical, and other techniques to obtain and identify useful information
|
|
|
Term
|
Definition
Data mining finds patterns and defines them in terms
of mathematical rules that can be used for prediction or association.
|
|
|
Term
The four broad categories for data mining algorithms:
Prediction
|
|
Definition
Uses the past to tell what will happen in the future |
|
|
Term
The four broad categories for data mining algorithms:
Cluster Analysis
|
|
Definition
Identifies natural groupings of things based on their known characteristics |
|
|
Term
The four broad categories for data mining algorithms:
Association Analysis
|
|
Definition
Find commonly co-occuring groupings of things |
|
|
Term
The four broad categories for data mining algorithms:
Sequential Relationships
|
|
Definition
Don't need to know meaning! |
|
|
Term
Other data mining procedures include: |
|
Definition
Data visualization and time series forecasting |
|
|
Term
What are the most common of all data mining approaches?
|
|
Definition
Classification procedures |
|
|
Term
What does Classification involve and name a few examples |
|
Definition
Involves identifying patterns of data and associates them with observations belonging to a certain category.
Examples can include credit approval, store location, target marketing, and fraud detection
- Most common of all data mining approaches |
|
|
Term
The basic idea of Classification Analysis |
|
Definition
Define the data
use the data to develop a mathematical model
then use that model to predict unknown outcomes for future observations |
|
|
Term
Various mathematical techniques are used to develop models for classification. These techniques fall into categories, such as: |
|
Definition
a. Decision tree: for classification if the outcome is categorical and the predictors that are either categorical or numeric
b. Linear discriminant analysis (LDA): if the outcome is categorical and the predictors are all numeric have normal distributions and equal variances
c. Logistic Regression Analysis (LRA): if the outcome is continuous numeric and the predictors are all numeric have normal distributions and equal variances
|
|
|
Term
Organizations must use a standardized approach for conducting a data mining project and be able to identify some proposed models.
These models include: |
|
Definition
|
|
Term
The six steps of the CRISP-DM model: |
|
Definition
- Business Understanding- discussing the environment
- Data Understanding- determining the variables to be measured
- Data Preparation- collection and formation of data
- Modeling- detect patterns and relate those to mathematical explanations
- Evaluation- determine it's effectiveness and that it's a good representaion of the material
- Deployment- use of the model for business decisions
|
|
|
Term
|
Definition
Define, Measure, Analyze, Improve, Control |
|
|
Term
|
Definition
Sample, Explore, Modify, Model, Assess |
|
|
Term
|
Definition
Places observations (rows, customers, students, etc.) into groups so that the members share similar characteristics but the groups themselves are highly different.
Ex: Sorting hat in harry potter |
|
|
Term
How is cluster analysis different from classification analyses? |
|
Definition
Cluster Analysis- groups are unknown and created
Classification Analysis- groups are distinct and known |
|
|
Term
Common application of Market Segmentation |
|
Definition
An analysis that aids in dividing customers into groups based upon data descriptions so that you can individually target those groups
- used to understand the buyer behavior of customers
- used to help retailers in targeting similar groups of customers to determine the appropriate advertising campaign
|
|
|
Term
Examples of Market Segmentation |
|
Definition
Gender, age, income, education level.
(Brands towards men/women, music dowloads towards young, hearing aids towards old, etc) |
|
|
Term
|
Definition
Aimed at associations that establish relationships among items within a given record.
(Variables or Columns)
- The goal is to create groups of variables that are similar |
|
|
Term
|
Definition
In the retail business it refers to research that provides the retailer with information to help understand the purchase behavior of a buyer.
Ex: People who buy medicine, buy tissue.
People go to the store just for milk, so it's in the back of the store. |
|
|
Term
|
Definition
Most data is stored in text documents that lack structure. Text Mining is the semiautomatic process of extracting patterns from large amounts of unstructured data.
Aka: text data mining or knowledge discovery in text databases.
- Different from search engines because they use known relationships and text mining discovers new patterns. |
|
|
Term
Most popular text mining analyses (4)
|
|
Definition
- Summarization
- Categorizing/Classification
- Clustering
- Concept linking
|
|
|
Term
Common Applications of Text Mining |
|
Definition
Information can be gained by sifting through court orders, medical discharge summaries, quarterly reports, customer comments, etc.
Also emails. |
|
|
Term
|
Definition
Most basic form of text mining (used for summarization).
- The simplest data structure is the feature vector which is a weighted list of words
- The most important words in the text are listed along with their reletive importance
- As a result, the doc. is reduced to a list of terms and weights. The details of the document may not exist, but the key concepts are identified
|
|
|
Term
Term-Document Matrix (TDM) |
|
Definition
Used for the Categorization/Classification, Clustering, and Concept Linking analysis.
Created where the rows represent the documents and the columns represent the terms, and the frequencies represent the number of times a term appears in a document. |
|
|
Term
|
Definition
Maps unstructured information (in the form of a document of words) into a structured format (in the form of a feature/term vector) or a concept |
|
|
Term
|
Definition
A weighted list of words which defines a concept that describes unstructured information (document of words).
Created by 1: eliminating articles (the, and, etc.) 2: replace words with their roots (phones, phoning = phone) 3: make synonyms uniform (pupil = student) 4: Calculate the weights of remaining terms
|
|
|
Term
|
Definition
Term frequency or "tf"
-measures the number of times a word appears in a document.
ex: a large tf factor increases the weight
(graph in notes) |
|
|
Term
Term-Document Matrix (TDM)
|
|
Definition
Created where the rows and represent the documents and the columns represent the terms, and the frequencies represent the number of times a term appears in a particular document
- used for conducting analyses such as classification analysis/categorization, cluster analysis, and association analysis/concept linking 3 of the 4 popular types of text mining analyses |
|
|
Term
Text Mining Process
(3 tasks) |
|
Definition
Task 1: Establish the corpus- the purpose is to collect all documents related to a domain of interest for analysis. Then converted to a similar format.
Task 2: Create the TDM
Task 3: Extract the knowledge- done 4 ways (Classification analysis, clustering, association analysis, and trend analysis*)
*Trend analysis: Analyze text in various periods of time to see trends or see how concepts evolve over time
|
|
|
Term
|
Definition
Marketing and Customer Relationship Management
- Group customers with similar complaints, group with purchasing patterns
Security Application
- ECHELON surveillance is most prominent tm application
|
|
|
Term
|
Definition
The Web is the biggest data/text repository and is growing every day.
WM is the discovery of relationships from web data
ex: hyperlinks to websites from other websites |
|
|
Term
3 different areas of web mining |
|
Definition
- Web Content Mining- extracts and uses the content found within the web pages (key concepts).
- Web Structure Mining- Extracting useful information from the analysis of links found in the web documents. More links, more deep coverage of info.
- Web Usage Mining- Extracts and uses information that is generated through web page visits, traffic, transactions, etc. (user history)
|
|
|