Analysis and experimental study of various methods through
data mining
Yangpyeonggun
AI Researcher 2025_01 Hong Yong-ho
Summary:
This study presents how data mining techniques can be used to extract
meaningful patterns from large data sets and apply these patterns to solve
real-world problems. Focusing on the main data mining techniques of
classification, grouping, and association rule learning, we analyzed the
latest trends and applications of each technique. Through
experiments, we compare the performance of decision trees, Knearest
neighbors, Naive Bayes, K-means grouping, and Apriori
algorithms and discuss the pros and cons of each technique. The study
will present effective applications of data mining, including preprocessing
strategies to improve data quality and increase the accuracy of the analysis.
Keywords:
Data Mining, Classification, Clustering, Clustering, Association Rule
Learning, Decision Tree, K-Nearest Neighbor, Naive Bayes, K-Means
Clustering, Apriori Algorithm, Data Preprocessing,Big Data Analysis2 ---
1. Introduction.
Data mining is a method for extracting useful information from large data
sets, and is becoming increasingly important in a variety of
industrial fields. In particular, as the amount of data increases
exponentially, i t i s e s s e n t i a l t o d e v e l o p a n d a p p l y effective
data mining m e t h o d s . 1) This study aims to analyze the latest trends in
data mining methods and discuss their importance and necessity.
1.1 Research Background
Data mining is the process of analyzing large amounts of data to extract
useful patterns and information. Recently, data mining has been used in the
corporate, government, medical, and financial sectors for a variety of
applications, including decision support, predictive analysis, and trend
identification.
1.2 Research Objectives
The purpose of this study is to utilize data mining techniques to extract
significant patterns from a specific data set and analyze how this can be
applied to solve real-world problems.
2. Data Mining Overview
day data data My data data-mine Mining Data
Mining (Data Mining is the process of automatically extracting useful
patterns, rules, trends, or information from large data sets. The process
leverages a variety of techniques, including statistics, machine learning, and
database systems, and focuses on extracting hidden knowledge and insights
from data. Data mining is widely used by companies and research
institutions to support decision making.3 ---
The main data mining techniques include classification, clustering
T h e s e t e c h n i q u e s include association rule mining and
regression analysis.2) In particular, machine learning algorithms such as
random forests can be used to effectively model complex patterns in the data.
These techniques are u s e d t o a n a l y z e a n d p r e d i c t d a t a
a c c o r d i n g t o d i f f e r e n t g o a l s . 2) In particular, machine learning
algorithms such as random forests can effectively model complex patterns in
data.3)
1) Lipovetsky, S. (2022).Statistical and Machine-Learning Data
Mining: m e t h o d s f o r b e t t e r p r e d i c t i v e m o d e l i n g a n d a n a l y s i s o f
b i g d a t a . Technometrics, 64, 145-148.
2) Oatley, G. (2021). Data Mining, Big Data, and Crime Analysis. Wiley
Interdisciplinary Reviews: data mining and knowledge discovery, 12.
3) Malashin, I. P., Masich, I., Tynchenko, V., Nelyub, V. A., Borodulin, A.,
Gantimurov, A. P., Shkaberina, G., & Rezova, N. (2024).Dendrolimus
Dendrolimus sibiricus
Predicting the outbreak of the disease: predictive modeling based on data analysis
and genetic programming.Forests.4 ---
Data mining is used in a variety of fields, including finance, medicine,
marketing, and social media analysis. For example, it is used for disease
prediction and patient m a n a g e m e n t i n the medical field, 4) and
in manufacturing to predict defects to increase the efficiency of
production processes,5) and , teaching education
and field , and learning to predict outcomes and
provide customized learning experiences.6)
The data mining process is divided into the following stages: data collection,
data preprocessing, model building, evaluation and interpretation. Each
stage is essential for improving data quality and extracting meaningful
insights. Data preprocessing is particularly important and is an essential
step to remove noise from the data and ensure data consistency.
Data mining poses a variety of challenges, including data quality, security
and privacy issues, and complex interpretations. In particular, distributed
processing and real-time analysis of data have emerged as major
technological issues in the big data environment, and recently, active
research has been conducted to solve these problems by utilizing
metaheuristic techniques7) .
Thus, data mining offers innovative solutions in various fields and has
become an indispensable technology in the big data era. Future research is
expected to develop more elaborate and powerful data analysis techniques
by integrating it with artificial intelligence.
2.1 Definition of Data Mining
Data mining refers to the process of finding hidden patterns, relationships,
and rules in large data sets through the use of statistics, machine learning,
and database technologies. This allows a company to find examples of 5 ---
customer behavior6 ---
4) JayasriN.,. P., & Aruna,. R.
(2021). Big data analysis in healthcare using data mining and classification
techniques.I CT Express, 8, 250-257.
5) Dogan,. A., & Birant,. D.
(2021). Machine learning and data mining in manufacturing.Expert
Systems with Applications, 166, 114060.
6) Fischer, C., Pardos, Z., Baker, R., Williams, J., Smyth, P., Yu, R., Slater, S., Baker
R. . , & Warschauer,. M.
(2020). Mining big data in education: affordances and challenges.Review of
Research in Education, 44, 130-160.
7) Moshkov, M. M.,. Zielosko, B. B., & Zielosko, M. &
Tetteh, E. E. T. (2022). Selected data mining tools for data analysis in
distributed environments.E ntropy, 24.7 ---
side, anomalous trade detection, commodity recommendation, and various
other analyses.
The process of discovering useful patterns, relationships, rules, or trends in
large data sets by moving to extracting extracting out
finding putting putting putting putting putting
putting . This process is conducted primarily through the use of
techniques such as statistics, machine learning, pattern recognition, and
database systems, and concentrates on discovering meaningful information
hidden in the data. The ultimate goal of data mining is to analyze data to gain
knowledge and insights useful for decision making.
Data mining processes large volumes of data and enables future forecasting,
customer segmentation, anomaly detection, and pattern discovery through
automated analysis, and is used by companies and research institutions for
decision support, problem solving, and business optimization.
Data mining is the process of extracting useful patterns, trends, and
knowledge from large amounts of data to help solve business and scientific
problems through data analysis and prediction. The process utilizes
techniques from a variety of disciplines, including statistics, machine
learning, and database technologies, to analyze data in a variety of formats
to derive meaningful insights.
The main goal of data mining is to discover information hidden in data and
use it to predict, categorize, cluster, and perform other tasks.8) For
example, in finance and medicine, predictive modeling can predict
customer behavior and disease onset.9) In the education sector, it can be
used to predict academic outcomes and provide customized education.10)
In the education sector, it can be used to predict learning outcomes and provide
customized education.10) It is also applied to ecosystem data analysis for
environmental m o n i t o r i n g a n d i m p l e m e n t a t i o n o f p r e v e n t i v e
measures11).
The data mining process typically includes the stages of data collection, data
preprocessing, model building, evaluation, and interpretation. Data 8 ---
preprocessing is particularly important and n e c e s s a r y t o r e m o v e
n o i s e f r o m t h e d a t a a n d e n s u r e c o n s i s t e n c y . After this
preprocessing process, various algorithms are applied to model the data,
and finally the results are interpreted to contribute to substantive decision
making12) .9 ---
8) Lipovetsky, S. (2022).Statistical and Machine-Learning Data
Mining: m e t h o d s f o r b e t t e r p r e d i c t i v e m o d e l i n g a n d a n a l y s i s o f
b i g d a t a . Technometrics, 64, 145-148.
9) JayasriN.,. P., & Aruna,. R.
(2021). Big data analysis in healthcare using data mining and classification
techniques.I CT Express, 8, 250-257.
10)Fischer, C., Pardos, Z., Baker, R., Williams, J., Smyth, P., Yu, R., Slater, S., Baker
R. . , & Warschauer,. M. (2020). Mining big data
in education: affordances and challenges.Review of Research in Education, 44,
130-160.
11) Malashin, I. P., Masich, I., Tynchenko, V., Nelyub, V. A., Borodulin, A.,
Gantimurov, A. P., Shkaberina, G., & Rezova, N. (2024).Dendrolimus
Dendrolimus sibiricus
Predicting the outbreak of the disease: predictive modeling based on data analysis
and genetic programming.Forests.
12)Moshkov, M., Zielosko, B., & Tetteh, E., T.
(2022). Selected data mining tools for data analysis in distributed environments.E
ntropy, 24.10 ---
Recently, the development of data mining has been further enhanced by
integration with big data technologies.
The data mining industry is accelerating. To effectively process and analyze
large data sets, data mining tools that can operate in a distributed
environment are being developed, which contributes to increasing
the efficiency of data analysis .) These technological developments
play an important role in making an organization competitive in establishing
and implementing a data infrastructure strategy.
Data mining has become an essential technology in modern society,
supporting database decision making in a variety of industries and academic
fields. Future research is expected to develop more sophisticated data
analysis techniques by integrating machine learning and artificial
intelligence techniques.
2.2 Main data mining methods
Classification: A technique that divides data into predefined categories, such
as decision trees, random forests, support vector machines (SVM), and naïve
Bayes.
Clustering: A technique for grouping similar data points, including kmeans
clustering, hierarchical clustering, and DBSCAN.
Regression Analysis (Regression
Analysis): a technique to predict continuous values, including linear
regression, polynomial regression, and logistic regression.
Association Rule Learning: a technique for
finding interesting relationships between data items, represented by the
Apriori algorithm and FP-Growth used in market basket analysis.
Dimensionality Reduction: a
technique that reduces the dimensionality of data to increase processing 11 ---
speed and facilitate visualization.12 ---
methods, such as PCA (Principal Component Analysis), t-SNE, and LDA
(Linear Discriminant Analysis).
Anomaly Detection: a
technique that identifies data points that deviate from the general pattern.
, outlier detection models, and crowd-based methods are used.
Sequential Pattern Mining: Analyzes the pattern of events emitted
over time in chronological order.
13)Dhaenens, C., & Jourdan, L. (2022).
Metaheuristics for data mining: a survey of big data and opportunities. Annals of
Operations Research, 314, 117-140.13 ---
It is a search technique and is used to analyze data.
Other methods: text mining, time series analysis, web mining, and various
other specialized data mining methods.
It is a technique that predicts which of a given class a new data point belongs
to. Typical algorithms include decision trees, random forests, and support
vector machines (SVMs), which are also used in the medical field for
complex data analysis14).
Techniques that group data points based on similar characteristics
include K-means, hierarchical clustering, and DBSCAN. This technique is
used to discover natural data patterns and can be an effective data analysis
tool even in distributed environments15).
It is a technique for predicting continuous target variables. They include
linear regression, multinomial regression, and ridge regression, which are
useful for analyzing relationships among variables and building predictive
models. These techniques are particularly useful in areas such as
environmental monitoring16).
It is a method for discovering relationships between items in data and is
often used in cart analysis. T y p i c a l a l g o r i t h m s include Apriori
a n d FP-Growth, which are used for customer behavior analysis in various
industries.
It is a technique that identifies anomalous data that deviates from normal
patterns and plays an important role in financial fraud detection, network
security, and in the medical field17) .
It is a method that analyzes changes in data over time and predicts
future values, including ARIMA models and exponential smoothing
methods, which are used in climate data analysis and economic forecasting18) .14 ---
14)Alinejad-Rokny, H., Sadroddiny, E., & Scaria, V. (2018).
Machine learning and data mining techniques for medical complex data
analysis.Neurocomp uting, 276, 1.
15)Moshkov, M., Zielosko, B., & Tetteh, E., T.
(2022). Selected data mining tools for data analysis in distributed environments.E
ntropy, 24.
16)Malashin, I. P., Masich, I., Tynchenko, V., Nelyub, V. A., Borodulin, A.,
Gantimurov, A. P., Shkaberina, G., & Rezova, N. (2024).Dendrolimus
Dendrolimus sibiricus
Predicting the outbreak of the disease: predictive modeling based on data analysis
and genetic programming.Forests.
17) Sharma, M., Chaudhary, V., Sharma, P., & Bhatia, R. S. (2020).Medical
Applications for Intelligent Data Analysis.Intelligent Data Analysis.
18) Wu, X. X.,. Zhu, X. X., Wu, X. Wu, X., Zhu, X. G., & Wu, X.,
Zhu, X., Wu, Wu, G. & Ding, W. W.
(2016). M i n i n g D a t a w i t h B i g D a t a . IEEE Transactions on Knowledge and
Data Engineering, 26, 97-107.15 ---
These data mining techniques enable a deeper understanding of data and
allow for innovative and effective analysis across a variety of disciplines. In
particular, big data environments are increasing the efficiency of data
mining through metaheuristics and distributed processing19) .
Classification: A technique for classifying data items into predefined
categories (e.g., spam mail classification).
Clustering: a technique to group similar data items (e.g.,
customer segmentation)
Regression analysis: a technique for predicting continuous values
(e.g., predicting stock prices)
Association Rule Mining
Techniques for finding relationships between items (e.g., cart analysis).
3. research methods
3.1 Data set selection
Factors to consider when selecting a data set
Purpose and Goal: Clearly define the purpose and goal of data analysis and
modeling. This will help you understand what type of data you need.
Data Availability: It must be ensured that the required data actually exists
and is accessible.
Ensure that data can be accessed through public data sets, internal
databases, APIs, etc.
Data Size and Format: Evaluate if the size and format of the data set is
suitable for analysis and processing. If the data must be storage and
processing capacity, the data format should be checked for analytical
compatibility.16 ---
Data Quality: Evaluates the accuracy, completeness, and consistency of a
data set. Noisy data or data with many missing values may reduce the
accuracy of the analysis.
Domain suitability: ensure that the data is appropriate for the domain of the
problem you wish to analyze. Domain Knowledge.
19)Dhaenens, C., & Jourdan, L. (2022).
Metaheuristics for data mining: a survey of big data and opportunities. Annals of
Operations Research, 314, 117-140.17 ---
to evaluate the meaning and value of the data.
Ethics and Privacy: Ethical considerations regarding data use and data
protection laws must be observed. Appropriate anonymization and security
measures are required when using sensitive data.
Frequency of Updates: If you need the most up-to-date data, make sure your
data set is updated regularly. The up-to-dateness of the data may affect the
results of the analysis.
Define the goals of the project and what questions you want to answer
This is an important basis for selecting data mining methods and
determining data requirements and It is an important basis for
and data requirements. Malashin Malashin et al.20) provide a case study of
the development of a predictive model based on genetic programming
using climate variables and a forest attribute dataset to predict the
occurrence of a specific pest
The following is a list of the most common problems with the "C" in the "C" column.
To find the data sets you need, search a variety of sources, including public
databases, internal corporate data, and web scraping. It is important to
consider the legal and ethical considerations associated with the data
sources. For example, the ONET database can be an important data source
for occupational market analysis21) .
The process involves assessing the quality of the selected data set and
checking for missing values, outliers, data consistency, and accuracy. Data
quality directly affects the reliability of results
The processing of missing values and the choice of characteristics are
important to let the quality 22). The treatment of missing values and the
selection of characteristics are important to let the quality22) .
Considering the size and diversity of the data set, we need to make sure that
we have a large enough sample size. The data must be sufficiently diverse 18 ---
so that a variety of patterns and insights can be discovered. Peng et al.
studied the impact of data set size on data mining results.23)19 ---
20) Malashin, I. P., Masich, I., Tynchenko, V., Nelyub, V. A., Borodulin, A.,
Gantimurov, A. P., Shkaberina, G., & Rezova, N. (2024).Dendrolimus
Dendrolimus sibiricus
Predicting the outbreak of the disease: predictive modeling based on data analysis
and genetic programming.Forests.
21)Karakatsanis, I., AlKhader, W., MacCrory, F., Alibasic, A., Omar, M. A., Aung Z.,
& Woon,. W. (2017). A data mining approach to
monitoring job market requirements: a case study. Information Systems, 65, 1-6.
22) Dzulkalnine,. M. F., & Sallehuddin,. R.
(2019). Missing data assignment via fuzzy feature selection for diabetes datasets.
SN Applied Sciences, 1.
23) Peng, G., Sun, S., Xu, Z., Du, J., Qin, Y., Sharshir, S., Kandeal, A. W., Kabeel
A., & Kabeel, A. & Yang,. N. (2025). Effects of
Dataset Size and Big Data Mining Process for Investigating Solar Desalination
Using Machine Learning.International Journal of Heat and Mass Transfer.20 ---
The selected data set is easy to convert into an analyzable format through a
preprocessing process
It evaluates whether or not the It includes data purification, transformation,
and integration work, which are critical stages of data analysis.
Consider technical requirements such as dataset format, storage, and
accessibility to ensure compatibility with data mining tools and
environments Jeong et al. show how training data selection through
dataset distillation can contribute to rapid deployment of machine learning
workflows 24), which presents.
Selection of appropriate data sets through this systematic process,
maximizes the effectiveness of data mining, and ultimately leads to more
reliable insights and conclusions. Data set selection is the first step in data
analysis and should be approached with care in that it has a significant
impact on all subsequent processes.
This study used [description of the dataset used in the study, e.g., a n a l y s i s
o f specific customer purchase data] . This dataset is based on [Dataset
Source and Description] and contains a total of [n] attributes and [m] records.
3.2 data preprocessing
Data preprocessing is the process of preparing data for analysis and modeling.
Data Collection: Collect data from a variety of sources. This can be done
through databases, files, web scraping, etc.
Data purification: processes errors, duplicates, and missing values from the
collected data.
Correct Errors: identify and correct data entry errors
and incorrect values. Delete Duplicates: Searches for
and deletes duplicate data records.21 ---
Missing value processing: missing values are processed in various ways,
such as mean replacement, deletion, and predictive value replacement.
Data Conversion: Convert data into a format suitable for analysis.
Data type conversion: Converts data types such as numeric and character types
as needed.
24) Jeong,. Y.,. Hwang, M. M., & Hwang, M. &
Sung, W. (2022). W. (2022). Training data selection based on dataset distillation
for rapid deployment in machine learning workflows. Multimedia Tools and
Applications, 82, 9855-9870.22 ---
Scaling: apply normalization or standardization to keep the magnitude of a
characteristic constant
The following is a list of the most common problems with the "C" in the "C" column.
Encoding: To convert categorical data to numeric types, e.g., label encoding.
Data integration: data from multiple sources in one consistent data
Integrate into a set.
Selecting and extracting characteristics: Selecting characteristics useful for
the analysis or new characteristics
The following is a list of the most common problems with the "C" in the "C" column.
Feature Selection: Improve model performance by removing features
not needed for the analysis. Feature Extraction: Use PCA, LDA, etc. to
extract new features
or dimension reduction.
Data partitioning: Data is divided into training, validation, and test data to
prepare the model for evaluation of its performance.
Data preprocessing is an essential process in data analysis and machine
learning projects, responsible for converting raw data into an analysis-ready
format, enhancing data quality, and improving model performance.
Preprocessing processes include a variety of techniques such as missing
value processing, outlier detection, data transformation (normalization,
standardization, etc.), categorical data encoding, and data reduction. These
processes help ensure data consistency and accuracy and increase the
reliability of analytical results.
Recent studies have presented new trends and methodologies in data
preprocessing. For example, Mishra
showed t h a t data quality c a n be s i g n i f i c a n t l y i m p r o v e d b y
using a combination of multiple preprocessing techniques.(25).25) Wang e t
a l . cover the development of data preprocessing for medical data fusion
and present various challenges and prospects. Wang et al. 23 ---
Wang et al. Wang et al. Wang et al. Wang et al.
26) This can provide important insights, especially when dealing with complex
data sets. 26) This can provide important insights, especially when dealing
with complex data sets24 ---
Yes.
Preprocessing methodologies for special data sets have also been studied. For
example.
Pedroni et al. proposed a standardized preprocessing method
for EEG data,27) and Olisah et al. introduced an integrated approach of data
preprocessing and machine learning for diabetes prediction and
diagnosis.28) These studies
25) Mishra, P., Biancolillo, A., Roger, J., Marini, F., & Rutledge, D. (2020). New
data preprocessing trends based on ensembles of multiple preprocessing
techniques.TrA C - Trends in Analytical Chemistry, 132, 116045.
26) Wang, S., Celebi, M. E., Zhang, Y., Yu, X., Lu, S., Yao, X., Zhou, Q., MartinezGarcia,
M., Tian, Y., Górriz, J., & Tyukin, I. (2021).Biomedical Data Fusion for
Biomedical Data Preprocessing의 Advances: An Overview of the methods,
challenges, and prospects. Inf. Fusion, 76, 376-421.
27) Pedroni, A., Bahreini, A., & Langer, N.,
(2018).Automagic: standardized preprocessing of EEG big data. Neuroimage, 200,
460-473.
28) Olisah, C. C., Smith, L. N., & Smith, M. L. (2022). Predicting diabetes and25 ---
Provides an effective way to preprocess domain-specific data.
Preprocessing can save time and resources and ultimately support better
decision making. can be heavy necessary not s te
s p in s . Therefore, it is important to develop a preprocessing
strategy that is tailored to the characteristics of the project and
the data. This will optimize the quality of the data and ensure the accuracy of
the analysis.
Before data mining, the process of processing the data is important because
it often contains missing, outlier, or duplicate values. In this study, the
following pre-processing steps were taken
L a c k o f value disposition reason :
Alternative outlier
detection and removal
by averaging
Data standardization and normalization
3.3 analytical method
There are various types of analysis methods, which are selected primarily
based on the characteristics of the data and the purpose of the analysis.
Descriptive statistical analysis: a method for capturing basic characteristics of
data, such as mean, median
The distribution and trends of the data are understood by calculating the
standard deviation, standard deviation, and so on.
Regression Analysis: is used to model and predict the relationship
between two or more variables. It includes linear regression, polynomial
regression, and logistic regression.
Classification analysis: a method of classifying data into predefined
categories, including decision trees, random forests, and support vector 26 ---
machines (SVM).
Crowd analysis: k-means, hierarchical crowding, DBSCAN, etc. are used as
methods to find natural groups or patterns in the data.27 ---
Dimension reduction: This method reduces the dimensionality of data to
improve visualization and processing efficiency, and includes principal
component analysis (PCA) and t-SNE.
Diagnostics from a Data Preprocessing and Machine Learning Perspective.
Computer Methods and Programs in Biomedicine, 220, 106773.28 ---
Time series analysis: analyzes data as it changes over time to determine trends,
seasonality, and forecasts
ARIMA, SARIMA, LSTM models, etc. are used as a way to do things like
The following is a list of the most common problems with the "C" in the "C" column.
Associative rule learning: a way to discover interesting relationships
between items in a data set is the Apriori algorithm, used primarily for cart
analysis.
Statistical techniques are essential to understanding the distribution and
relationships of data. Typical examples include hypothesis testing,
regression analysis, and analysis of variance (ANOVA); these
techniques are used to understand the basic characteristics of data and to
analyze relationships among variables. These techniques play an
important role in increasing the reliability of the analysis, which must be
tailored to the characteristics and goals of the data.
Machine learning focuses on learning patterns in data to build predictive
models. Various types exist, including supervised learning (e.g., regression,
classification), unsupervised learning (e.g., clustering, dimensionality
reduction), and reinforcement learning. Data preprocessing has a significant
impact on the performance of machine learning algorithms, and recent
research has highlighted the advantage of using a combination of multiple
preprocessing techniques to improve data quality29) .
Data visualization assists in the intuitive understanding of patterns and
relationships through a visual representation of data. Various visual tools
such as histograms, scatter plots, and heat maps are effective in analyzing
data and communicating results
The following is a list of the most common problems with the "C" in the "C" column.
These visualization techniques help reduce the complexity of the data and
make it easier to understand the results of the analysis.
These analytical methods are used in a complementary manner to increase 29 ---
the accuracy and insight of data analysis and and and
Contribution Contribution I will will Contribute to The
choice of method depends on the characteristics of the data and the goals of the
analysis. The choice of each method depends on the characteristics of the
data and the goals of the analysis, and it is important to optimize the quality
of the data during the preprocessing process.30) The right combination of
data preprocessing and analysis methods supports better decision making.30 ---
The accuracy of the analysis can be guaranteed.
The following data mining methods were applied in this study
29) Mishra, P., Biancolillo, A., Roger, J., Marini, F., & Rutledge, D. (2020). New
data preprocessing trends based on ensembles of multiple preprocessing
techniques.TrA C - Trends in Analytical Chemistry, 132, 116045.
30) Pedroni, A., Bahreini, A., & Langer, N.,
(2018).Automagic: standardized preprocessing of EEG big data. Neuroimage, 200,
460-473.31 ---
Classification techniques: Decision Tree, K-Nearest-Neighbor (KNN), Naive
Bayes (Naive)
Bayes)
A decision tree is a supervised learning model used for data
classification and regression. The model consists of a set of rules for
making decisions based on characteristics of the data . A decision tree
consists of a tree structure, where each internal node represents a test for
a characteristic, each branch represents a branching by test result, and
each leaf node represents a final prediction or outcome.
Intuitive ease of understanding: The tree structure is visually intuitive,
making the decision-making process easy to understand.
Unnormalized data processing: can process a variety of data types
without scaling or normalization.
Can be used for a variety of problems: can be used for both classification
and regression, and can model complex data relationships.
Easy interpretation and intuitive understanding
of results. Requires few preprocessing steps
and reflects the characteristics of the data well.
Handling non-linear relationships well.
There is a risk of over-adaptation (overfitting). To prevent this, pruning
techniques are used.
Sensitive to small data changes and may cause
instability in the tree structure. May be inefficient for
large data sets.
Decision trees are used in a variety of fields, including medical diagnostics,
financial fraud detection, customer churn prediction, and marketing
strategy development. They can assist in database decision making and
clearly explain relationships within complex data.
feelings intention decision decision Decision Tree
(Decision Tree) is a predictive model that is easy to understand and
interpret and is widely used for data classification and regression
problems. The method forms a tree structure based on the characteristics 32 ---
of the data, divides the data through decision rules at each node, and
finally decomposes the data at the leaf nodes into the final33 ---
The system provides a forecasting result that is
The greatest advantage of the decision tree is its intuitive understanding and
visualization
The following is a list of the most common problems with the "C" in the "C" column.
It also handles nonlinear relationships in the data well, and the
preprocessing process is relatively simple. and the preprocessing
process is relatively simple. in terms of in practical
practical practical and and practical However, overfitting
problems may occur. However, overfitting problems may occur
To prevent this, pruning and ensemble techniques, such as Random
Forest, are commonly utilized.34 ---
Recent studies have shown that various approaches to improve the
performance of decision trees
have been proposed. For example, r e s e a r c h h a s b e e n c o n d u c t e d t o
a c h i e v e b e t t e r p r e d i c t i v e p e r f o r m a n c e on complex data sets in
combination with deep learning. jiang et al.31) showed effective
performance on complex data sets by transition boosting of deep decision
trees ,31) Sagi and Rokach proposed a method for making decision
forests into interpretable trees to improve explainability.32)
Decision trees have also been applied in various domains, and
optimization methods appropriate to each field have been studied. For
example, Liu et al. applied tree-enhanced gradient boosting to credit
score evaluation and reported improved performance,33) and Marudi et al.
developed a decision-num-based method suitable for ordinal
classification problems.34)
Thus, decision trees expand their applicability in various fields through
continuous research and development, with the potential to provide
customized solutions to specific problems. Such developments complement
the shortcomings of decision trees and further expand their applicability to a
variety of data sets and problem types.
KKNN
(KNN) is a classification or regression analysis based on the similarity
of data points Teach Master A Ri Learn Shu A Le Go
Ri Z M In S . The algorithm refers to the K nearest neighbors to
determine the class of the new data point.
Non-parametric model: Does not require assumptions
about data distribution. Simple: Easy and intuitive to
implement.
Similarity-based: decision making leverages the distance between data points.
Simple and easy to understand: The algorithm is intuitive and can be used
without complex mathematical models.
Applicable to a variety of problems: can be used for both classification and 35 ---
regression problems
Short training time: no learning phase, only calculations are
required during forecasting. Short training time: few
learning phases, computation is required only during
forecasting. Computational cost is: -100% of the cost of a36 ---
When forecasting with data, a lot of computation is required. Memory
consumption is
All training data must be saved.
31)Jiang, S., Mao, H., Ding, Z., & Fu, Y. (2020).Deep Decision Tree Transfer
Boosting.IEEE Transactions on Neural Networks and Learning Systems, 31, 31,
383-395. IEEE Transactions on Neural Networks and Learning Systems, 31,
383-395.
32) Sagi, O., & Rokach, L. (2020).Explainable decision forests:
transforming decision forests into interpretable trees. Information
Fusion, 61, 124-138.
33) Liu, W., Fan, H., & Xia, M.
(2021). Credit scorelin based on tree-enhanced gradient-boosted decision
trees.Expert Systems with Applications, 189, 116034.
34) Marudi, M., Ben-Gal, I., & Singer, G. (2022).
A Decision Tree-Based Method for Sequential Classification Problems. IISE
Transactions, 56, 960-
974.37 ---
Sensitivity to characteristic scale: Since it is distance-based, it is sensitive to
the scale of the characteristic, scaling
The following is a list of the items that may need to be checked.
KNN is used in image classification, recommendation systems,
pattern recognition, etc. They are especially useful when complex
data preprocessing and model design are not required.
Proper selection of K values has an important impact on performance.
Typically, cross-validation is used to find the optimal K.
K-Nearest Neighbors(K-Nearest Neighbors,
KNN) is an intuitive and easy to implement classification and regression
algorithm that makes predictions based on the K nearest neighbors of a
given data point. The algorithm primarily uses distance measures, such
as Euclidean distance, to evaluate the similarity between data points and
derives predictions by referring to the labels of the K nearest neighbors.
The greatest advantage of KNN is that it does not require assumptions about
data distribution and c a n b e e a s i l y a p p l i e d t o various d a t a t y p e s .
However, it is computationally expensive and suffers the curse of
dimensionality, i . e . , performance degrades as the dimensionality of the
data increases.
. To solve this, researchers are using various dimensionality reduction
techniques (e.g., principal component analysis, PCA) or studying ways to
select appropriate K values.
Recent research has proposed various approaches to improve KNN
performance. For example, there are methods to diversify distance
measurement methods,35) apply weighting-based KNNs, and attempt to
combine them with ensemble techniques. Ensemble techniques have also been
attempted. Ensemble techniques have also been attempted. Ensemble
techniques have also been attempted. Some attempts have been made to
combine them with ensemble techniques. For example, there are
methods to diversify distance measurement methods or to apply weighting-based
KNN,35) and attempts have been made to combine them with ensemble
techniques. .36) Efforts are also being made to improve efficiency, 38 ---
especially with large data sets. .36) In particular, efforts are also being made
to improve efficiency on large data sets, and Spark Bayesian
Spark based of the Design design and 37)
Algorithms for processing big data are being developed.38)
KNN is used in a variety of fields, including image recognition,
recommendation systems, and text classification, and is particularly
effective on small data sets. On large data sets, however, it must be used in
comparison to other algorithms for computational efficiency.39 ---
It plays an important role in extending the flexibility and applicability of the
35) Zhang, S., Li, J., & Li, Y. (2021).Reachable distance functions for KNN
classification.IEEE Transactions on Knowledge and Data Engineering, 35,
7382-7396.
36) Zhu, X., Ying, C., Wang, J., Li, J., Lai, X., & Wang, G.
(2021). E n s e m b l e o f ML-KNN for classification algorithm
recommendation.Knowledge- Based Systems, 221, 106933.
37) Maillo, J., Ramírez-Gallego, S., Triguero, I., & Herrera, F. (2017). kNN-IS: An
Iterative Spark-based design of the k-Nearest Neighbors classifier for big
data.Knowledge-Based Systems, 117, 3-15.
38) Chatzigeorgakidis, G., Karagiorgou, S., Athanasiou, S., & Skiadopoulos, S.
(2018).FML-kNN:. k-nearest scalable
machine learning on big data using neighbor joins. Journal of Big Data, 5.40 ---
39), which contributes to the accuracy of forecasts.
Na Nah. N Nive V Nive Naive Bayes
(Naive Bayes) is a supervised learning model based on probability theory
that performs classification by calculating the probability that given data
belongs to a particular class. The algorithm is based on the assumption of
conditional independence, where each characteristic is assumed to be
independent of each other.
Probability-based model:.
Computes class probabilities using Bayes Theorem.
Conditional Independence: Simplifies calculations by
assuming independence between properties. Rapid
Training and Prediction: Calculations are simple and
efficient.
Simple and fast: The simplicity of the calculations allows even
large amounts of data to be processed quickly. Resistant to noise:
Noise in some characteristics does not significantly affect
predictions.
Can be trained with less data: High performance can be achieved with less
training data.
Limitations of the conditional independence assumption: In reality,
correlations between characteristics may exist, and this
assumptions may degrade performance.
Continuous type data processing: Continuous type data requires
preprocessing because it basically deals with discrete type data.
Naive Bayes is often used in text classification, sentiment, document
classification, etc. They are very useful in text processing, and exhibit fast
and stable performance with many properties. Various variants of Naive
Bayes (e.g., Gaussian Naive Bayes, Bernoulli Naive Bayes) are available and
can be selected according to the characteristics of the data.
Naive Bayes(Naive
Bayes) is an intuitive and powerful classification algorithm based on Bayes'
theorem that is widely used in a variety of fields, primarily text classification,
medical, and customer classification. The algorithm assumes that each 41 ---
characteristic is independent and combines the prior probability of the class
with the conditional probability of the characteristic to make a final
prediction. This "naïve" assumption allows for easy computation and rapid
learning and prediction, even with large amounts of data.42 ---
The main advantage of Naive Bayes is its ability to achieve effective
classification performance even with small amounts of data, and it performs
particularly well with high-dimensional data. However, performance can be
compromised if the assumption of independence between properties is not
realistic. To compensate for this, various variant models have been
proposed that take into account correlations between properties. For
example, Xu40) proposed a vector classification for text classification.
39) Uddin, S., Haque, I., Lu, H., Moni, M., & Gide, E.
(2022). Comparative performance analysis of the K-Nearest Neighbour
(KNN) algorithm and its various variants for disease
prediction.Scientific Reports, 12.
40) Xu, S. (2018). Bayesian naive Bayes classifier to text classification.Journal
of Information Science, 44, 48-59.43 ---
isian naïve Bayes classifier and Chen et al. 41) proposed an improved traffic
risk management
The performance was improved by applying the naïve Bayesian classification
algorithm that has been
In particular, Naive Bayes is frequently used thanks to its easy
implementation in real-time applications and early prototyping stages, and
various studies have aimed to improve performance based on it. OntiveroOrtega
et al. 42) have used Naive Bayes for classification analysis and Gan et
al. 43) have improved its performance for text classification.
Despite its simplicity and efficiency, Naive Bayes has established itself as an
effective model in a variety of fields and, through continued research and
development, has the potential to be applied to a wider variety of problems.
These developments have helped to complement the shortcomings of Naïve
Bayes and expand its applicability to more complex problems.
Clustering Technique: K-means Clustering
K-means Clustering(K-means
Clustering) is an unsupervised learning algorithm that divides the data into
K clusters, and for each clusters , in mind (centroid) , look
at , at , at , at , and so on. The algorithm assigns each
data point to the nearest center to form a crowd.
Unsupervised learning: clustering unlabeled data.
Distance-based: calculates the distance between the center of the crowd
and the data points using Euclidean distance, etc.
Iterative process: repeat initial center setting,
assignment, and update. Initial center setting: K
centers are set arbitrarily.
Assignments: 1.
Assign each data point to the nearest center to form a cluster
Center update: The center of each cluster is newly calculated and updated.44 ---
Repeat: Until the center remains unchanged or the preset number of
repetitions is reached.45 ---
and repeat steps 2 and 3.
41)Chen, H., Hu, S., Hua, R., & Zhao, X. (2021).
An improved naïve Bayesian classification algorithm for traffic risk
management.EURA SIP Journal on Advances in Signal Processing, 2021.
42) Ontivero-Ortega, M., Lage-Castellanos, A., Valente, G., Goebel, R., & ValdésSosa,
M. M.
(2017). Fast Gaussian Naive Bayes for searchlight classification
analysis.Neuroimage, 163, 471-479.
43) Gan, S., Shao, S., Chen, L., Yu, L., & Jiang, L. (2021).Adapting Hidden Naive
Bayes to text classification.Mathematics.46 ---
Simple and fast: easy to implement, computationally efficient
It is Scalability: can be applied to large amounts of data
The following is a list of the most common problems with the "C" in the "C" column.
Ease of interpretation: results are intuitive and easy to interpret.
Sensitive to initial values: results may differ significantly depending on
initial center setting. Requires pre-determination of the number of
crowds (K): the
K values must be determined in advance; incorrect settings may result in
inappropriate crowding
Suitable for spherical communities: more effective when the community shape
is spherical.
K-means grouping is used for customer segmentation, image
compression, and data preprocessing. and data preprocessing
customer segmentation, image compression, data preprocessing, etc.
customer segmentation, image compression, data preprocessing, etc.
customer segmentation, image compression, data preprocessing, etc.
K-means clustering can be used for a variety of purposes. Techniques such as the
Elbow Method are often used to determine K values.
k
Because averages are easy to implement and compute quickly, they can be
used effectively even with large data sets for , for ,
and for , for , and for . However, the results may
differ depending on the initial center value setting and may converge to a
local minimum44) .
Determining the optimal number of clusters K is important. Methods
such as the elbow method and silhouette analysis are widely used, and
these can help assess the quality of the clustering results45) .
K-means is suitable for spherical clusters and may perform poorly on
nonspherical data. Various deformation algorithms have been proposed to
improve this46) .
Parallel and distributed processing techniques have been developed for 47 ---
the application of K-means in big data environments. and distributed
processing techniques have been developed for the application of K-means
in big data environments. and distributed processing techniques have
been developed for the application of K-means in big data environments.
and distributed processing techniques have been developed for the
application of K-means in big data environments. and distributed
processing techniques have been developed for the application of K-means
in a big data environment. (See Figure 1. Such an approach reduces data
processing time and optimizes memory usage47).48 ---
Various methods have been studied to resolve the randomness of initial
center setting and increase convergence speed 究 さ れ て い
ま す . For example, e e b and K-means initialization
methods and acceleration methods that utilize geometric concepts.48)
44) Sinaga, K. P., & Yang, M. (2020).Unsupervised K-Means Clustering
Algorithm.IEEE Access, 8, 80716-80727.
45) Yu, H., Wen, G., Gan, J., Zheng, W., & Lei, C. (2020).Self-paced Learning
for K-means Clustering Algorithm.Pattern Recognition Letters, 132, 69-75 .
46) He, H., He, Y., Wang, F., & Zhu, W.
(2022). An improved K-means algorithm for clustering aspheric
data. Expert Systems, 39.
47) Mussabayev, R., Mladenović, N., Jarboui, B., & Mussabayev, R. (2022).Big
Data Clustering for How to Use K-means?Pattern Recognition, 137, 109269.
48) Ismkhan, H., & Izadi, M. (2022).K-means-G*:.
Speeding Up k-means Clustering Algorithms Using Primitive Geometric
Concepts. Information Science, 618, 298-316.49 ---
KMean
grouping is widely used in various fields because of its simplicity and
versatility, and
and overcoming its limitations through continuous research and refinement.
These studies have been conducted on the Kaverage
performance and contribute to better adaptability to more complex
data structures.
Association Rule Analysis: Apriori Algorithm
The Apriori algorithm finds frequent item sets from the database and uses
association rules to
It is an algorithm used to accomplish It is mainly used in data mining tasks
such as shopping cart analysis.
Find Frequent Itemsets: Finds itemsets in the data that occur frequently
in . Association Rules
Last name: Last name is a rule that indicates the relationship
between items based on a set of frequent items.
Iterative process: find frequent items while exploring increasingly larger item
sets.
Initialization: Calculate the frequency of each item, and determine the
minimum support (minimum
support) or higher.
Create a frequent itemset : size 1
The size of the item set is gradually increased based on the frequent item
set of the
Confidence calculation: for each set of frequent items, an association rule
The rules are then used to select the rules that satisfy the minimum
confidence level. Shopping cart analysis: Identifies products that
customers purchase together and uses this information to develop
marketing strategies. Recommendation system: to identify the products
that customers are likely to buy together and use this information to
develop marketing strategies.50 ---
Provides product recommendations based on user behavior.
Fraud detection: identifies unusual patterns in transaction data.
The Apriori algorithm works well with large databases, but should51 ---
This can lead to high computational costs because of the need to evaluate all
possible combinations of items. To improve this, FP
Alternatives such as the Growth algorithm also exist.
This is a measure of how often a particular set of items appears in the overall
transaction data. The support map is used as a criterion to determine the
significance of the association rule, and the user sets the minimum support
map according to the purpose of the analysis.
Defined as the conditional probability between two items
that when one item is uttered, the other item is uttered.
It provides the probability that the This is used to evaluate the strength of
the association rule.
The Apriori algorithm starts with a 1-itemset, and then k-.
Iterative post-processing is performed to derive the itemset.52 ---
support support body collection aggregation (1).
This is done by way of forming and filtering. This process is repeated until a
maximum size itemset is found that meets the given minimum support
The following is a list of the most common problems with the "C" in the "C" column.
Apriori is and The frequent frequently
optimizes memory usage by preliminarily deleting item sets that are not
Optimizes memory usage by preliminarily deleting item sets that do not occur
frequently. This is designed to ensure efficient processing even as data sets
grow in size.
If the size of the data set is large, the computational complexity can increase
significantly
However, the performance may be degraded when the data is small. To solve
this problem, various transformational algorithms have been developed.
For example, research is being conducted to improve the performance of
algorithms by utilizing parallel and distributed processing techniques49) .
The Apriori algorithm is used in a variety of fields, including market
basket analysis, recommendation systems, and failure cause analysis,
and is important for extracting useful patterns from data. important for
extracting useful patterns from data. Apriori Algorithm
role in extracting useful patterns from data. role in extracting useful
patterns from data. Apriori algorithms are used in various fields such
as market basket analysis, recommendation systems, and failure cause
analysis. .50) Recent research has proposed the EAFIM (Efficient Aprioribased
Frequent Itemset Mining) algorithm, which l e v e r a g e s t h e
Spark p l a t f o r m t o increase the efficiency of the Apriori algorithm,
enabling more effective pattern analysis from large transaction data. 51)
These improvements expand the utility of the Apriori algorithm and
increase its applicability in a variety of industries.
4. Experiments and Results53 ---
4.1 experimental setup
The experiment [divides a portion of the dataset into training and test data. ]
] Do is done now I did I was there.
The54 ---
Each of the methods was compared under the same conditions, and the
performance of the models was evaluated in terms of accuracy (Accuracy),
precision (Precisi on), recall (Recall), and F1 score.
4.2 result
49) Kadry, S. S.
(2021). An Efficient A priori Algorithm for Frequent Pattern Mining Using
mapreduce in Healthcare Data. Bulletin of IEICE.
50) Chen,. H., Yang, H. M., Yang, M. &
Tang,. X. (2024).Associative rule mining of aircraft event causes based
on the Apriori algorithm.Scie ntific Reports, 14.
51)Raj, S., Ramesh, D., Sreenu, M., & Sethi, K., K.
(2020).EAFIM: An efficient appliance-based f r e q u e n t i t e m s e t m i n i n g
a l g o r i t h m o n Spark for big transaction data. Knowledge and Information
Systems, 62, 3565-3583.55 ---
Classification method: decision tree recorded [performance, including
accuracy/precision/reproducibility] KNN
Technique showed [Result] and Naive Bayes showed [Performance].
Crowding Methodology: K-means crowding resulted in [ Crowd
Result ]. An analysis of the distribution of the crowds and the
characteristics of each crowd allowed us to define [customer type].
Associative Rule Analysis: Using the Apriori algorithm, we were able to
derive "Example Associative Rules". For example, we found a rule such as
"If customer A buys product X, there is an 80% probability that he will buy
product Y.
5. discussion
5.1 Comparison of Techniques
The classification, clustering, and association rule methods used in this
study are useful for solving different types of problems. For
example, the classification method is suitable for clear category prediction,
the grouping method is useful for analyzing customer types, and the
association rule method is effective for developing marketing strategies.
5.2 Limitations of the Study
Some of the methods in this study may not optimize performance due to
limitations in data set size, specific variables, etc. In addition, performance
may vary when applied to actual environments due to changes in the data.
In addition, performance may differ when applied in a real-world
environment due to changes in the data.
6. Conclusion.
This research utilizes data mining techniques to analyze a variety of data and
develop meaningful56 ---
exile pat patter pattern pattern to Extract terns
The workshop was a great success. W e w e r e a b l e t o identify the
strengths, weaknesses, and applicability of each technique and g a i n
i n s i g h t i n t o h o w t h e y c a n b e u s e d t o solve real-world
problems. Future research should explore ways to apply larger data sets and
different algorithms to improve performance and apply them to a variety of
real-world cases.57 ---
References
Alinejad-Rokny, H., Sadroddiny, E., & Scaria, V.
(2018). Machine learning and data mining techniques for medical complex
data analysis.Neuroc omputing, 276, 1.
Alguliyev, R., Aliguliyev, R., & Sukhostat, L.
(2021). Parallel batch k-means for big data clustering.Computers and
Industrial Engineering, 152, 107023.
Chen, H., Hu, S., Hua, R., & Zhao, X.
(2021). An improved naïve Bayesian classification algorithm for traffic risk
management.E URASIP Journal on Advances in Signal Processing, 2021.
Chen,. H., Yang, H. M., Yang, M. &
Tang,. X. (2024).Associative rule mining of aircraft event causes
based on the Apriori algorithm. Scientific Reports, 14.
Chatzigeorgakidis, G., Karagiorgou, S., Athanasiou, S., & Skiadopoulos,
Skiadopoulos
S. (2018).FML-kNN:. k-nearest scalable
machine learning on big data using neighbor joins. Journal of Big Data, 5.
Deng, Z., Zhu, X., Cheng, D., Zong, M., & Zhang, S.
(2016). An Efficient kNN Classification Algorithm for Big Data.
Neurocomputing, 195, 143-148.
Dhaenens, C. C., & Jourdan,. L.
(2022). Metaheuristics for data mining: a survey of big data and
opportunities. Annals of Operations Research, 314, 117-140.
doi:10.1016/j.operationsresearch.2011.09.002.
Dogan,. A., & Birant,. D.
(2021). Machine learning and data mining in manufacturing.Expert
Systems with Applications, 166, 114060.
Dzulkalnine,. M. F., & Sallehuddin,. R.
(2019). Missing data assignment via fuzzy feature selection for diabetes
datasets. SN Applied Sciences, 1.
Fischer, C., Pardos, Z., Baker, R., Williams, J., Smyth, P., Yu, R., Slater, S.,
Baker,58 ---
R. B., & Warschauer,. M.
(2020). Mining big data in education: affordances and challenges.Rev iew of
Research in Education, 44, 130-160.
Gan, S., Shao, S., Chen, L., Yu, L., & Jiang, L.
(2021). Adapting Hidden Naive Bayes to Text Classification. Mathematics,
None.
He, H., He, Y., Wang, F., & Zhu, W.
(2022). An improved K-means algorithm for clustering nonspherical data.
Expert Systems, 39.
JayasriN.,. P., & Aruna,. R.
(2021). Big data analysis in healthcare using data mining and classification
techniques.ICT Express, 8, 250-257.
Jeong,. Y.,. Hwang, M. M., & Hwang, M. &
Sung, W. (2022). W. (2022). Training data selection based on dataset
distillation for rapid deployment in machine learning workflows.
Multimedia Tools and Applications, 82, 9855-9870.59 ---
Jiang, S., Mao, H., Ding, Z., & Fu, Y. (2020).Deep Decision Tree Transfer
Boosting.IEEE Transactions on Neural Networks and Learning Systems, 31,
383-395.
Kadry, S. S.
(2021). An Efficient A priori Algorithm for Frequent Pattern Mining Using
mapreduce in Healthcare Data. Bulletin of the Institute of Electronics,
Information and Communication Engineers, None.
Karakatsanis, I., AlKhader, W., MacCrory, F., Alibasic, A., Omar, M. A.,
Aung Z., & Woon,. W.
(2017). A data mining approach to monitoring job market
requirements: a case study. Information Systems, 65, 1-6.
Liu, W. W., Fan, H. H., Fan, H. & Xia,. M.
(2021). Credit scorelin based on tree-enhanced gradient boosting decision
trees.Expert Systems with Applications, 189, 116034.
Lipovetsky, S. (2022).Statistical and Machine-Learning Data Mining:
methods for better predictive modeling and analysis of big data.
Technometrics, 64, 145-148.
Maillo, J., Ramírez-Gallego, S., Triguero, I., & Herrera, F. (2017). kNN-IS:
An Iterative Spark-based design of the k-Nearest Neighbors classifier for
big data.Knowledge-Based Systems, 117, 3-15.
Malashin, I. P., Masich, I., Tynchenko, V., Nelyub, V. A., Borodulin, A.,
Gantimurov, A. P., Shkaberina, G., & Rezova, N. (2024).Dendrolimus
sibiricus of Dendrolimus sibiricus. Dendrolimus sibiricus
occurrence of Dendrolimus sibiricus Prediction Prediction of sibiricus
Predictive modeling based on data analysis and genetic
programming.Forests, None.
Mao, Y., Gan, D., Mwakapesa, D. S., Nanehkaran, Y. A., Tao, T., & Huang, X.
(2021).MapReduce Be - Su of K- means clustering
algorithm.Journal of Supercomputing 78, 5181-
5202.
Metz, M., Lesnoff, M., Abdelghafour, F., Akbarinia, R., Masseglia, F., &
Roger, J. (2020). " B i g d a t a " a l g o r i t h m s f o r KNN-PLS. Chemometrics
and Intelligent Laboratory Systems, None.
Mishra, P., Biancolillo, A., Roger, J., Marini, F., & Rutledge, D. 60 ---
(2020). New data preprocessing trends based on ensembles of multiple
preprocessing techniques. TrAC - Trends in Analytical Chemistry, 132,
116045.
Moshkov, M., Zielosko, B., & Tetteh, E., T.
(2022). Selected data mining tools for data analysis in distributed
environments.Entropy, 24.
Mussabayev, R., Mladenović, N., Jarboui, B., & Mussabayev, R. (2022).Big
Data Clustering for How to Use K- means?Pattern Recognition, 137, Pattern
Recognition, 137, 109269.
Olisah, C., C., Smith, L., N., & Smith, M., L.
(2022). D i a b e t e s p r e d i c t i o n a n d d i a g n o s t i c computers f r o m
a d a t a p r e p r o c e s s i n g a n d m a c h i n e l e a r n i n g
p e r s p e c t i v e .61 ---
Biomedical Methods and Programs, 220, 106773.
Oatley, G. (2021). Data Mining, Big Data, and Crime Analysis. Wiley
Interdisciplinary Reviews: data mining and knowledge discovery, 12.
Ontivero-Ortega, M., Lage-Castellanos, A., Valente, G., Goebel, R., &
Valdés-Sosa,
M. (2017). Fast Gaussian Naive Bayes For searchlight
Classification Analysis. Neuroimage, 163, 471-479.
Pedroni, A., Bahreini, A., & Langer, N.,
(2018).Automagic: standardized preprocessing of EEG big data. Neuroimage,
200, 460-473.
Peng, F., Sun, Y., Chen, Z., & Gao, J. (2023).An Improved Apriori Algorithm
for Association Rule Mining in Employability Analysis.Tehnicki Vjesnik -
Technical Gazette, None.
Peng, G., Sun, S., Xu, Z., Du, J., Qin, Y., Sharshir, S., Kandeal, A. W., Kabeel
A., & Kabeel, A. & Yang,. N. (2025). Influence
of Dataset Size and Big Data Mining Process in Solar Desalination
Studies Using Machine Learning.International Journal of Heat and Mass
Transfer, None.
Raj, S., Ramesh, D., Sreenu, M., & Sethi, K.,
K. (2020).EAFIM: An efficient appliance-based frequent itemset mining
algorithm on Spark for big transaction data. Knowledge and Information
Systems, 62, 3565-3583.
Ratner, B. (2021).Statistical and Machine-Learning Data
Mining: techniques for better predictive modeling and analysis of big data.
Technometrics, 63, 280-280.
Sagi, O., & Rokach, L. (2020).Explainable decision forests: transforming
decision forests into interpretable trees. Information Fusion, 61, 124-138.
Sharma, M., Chaudhary, V., Sharma, P., & Bhatia, R. S. (2020).Medical
Applications for Intelligent Data Analysis.Intelligent Data Analysis, None .
Sinaga, K. P., & Yang, M. (2020).Unsupervised K-Means Clustering
Algorithm.IEEE Access, 8, 80716-80727.62 ---
Uddin,. S.,. Haque, I. I., Haque,. Lu, I., Haque, I., Lu, H.
H., Lu, Lu, Lu, Lu, Lu, Lu, Lu, Lu, Lu, Lu, Lu, Lu Moni, M. M., &
Gide, E. (2022). E. (2022). Disease disease Prediction
measurement of for for K-Nearest K-Nearest Neighbour (KNN)
a l g o r i t h m a n d c o m p a r a t i v e p e r f o r m a n c e a n a l y s i s o f i t s
v a r i o u s v a r i a n t s .Scientific Reports, 12.
Vargas, V. W. d., Aranda, J. A. S., Costa, R. d.S., Pereira, P. R. d.S., & Barbosa,
J. L. V.
(2022). Imbalanced data preprocessing techniques for machine
learning: a systematic mapping study. Knowledge and Information
Systems, 65, 31-57.
Wang,. H., & Gao, Y., & Gao, Y.
(2021). Y. (2021). A study on parallelization of the Apriori
algorithm in association rule mining.Procedia Computer Science, 183,
641-647.63 ---
Wang, S., Celebi, M. E., Zhang, Y., Yu, X., Lu, S., Yao, X., Zhou, Q.
Martinez-Garcia, M., Tian, Y., Górriz, J., & Tyukin, I. (2021).Biomedical
Data Fusion for Biomedical Data Preprocessing의 Advances: An Overview of
Fusion, 76, 376-421.
Wu, X., Zhu, X., Wu, G., & Ding, W.
(2016). Mining Data with Big Data. IEEE Transactions on Knowledge
and Data Engineering, 26, 97-107.
Xu, S. (2018). Bayesian naive Bayes classifier to text classification.Journal of
Information Science, 44, 48-59.
Yu, H., Wen, G., Gan, J., Zheng, W., & Lei, C. (2020).Self-paced Learning for
K-means Clustering Algorithm.Pattern Recognition Letters, 132, 69- 75 .
Zhang, S., Li, J., & Li, Y. (2021).Reachable distance functions for KNN
classification.IEEE Transactions on Knowledge and Data Engineering, 35,
7382-7396.
Zhang, S., Li, X., Zong, M., Zhu, X., & Wang, R. (2018).Efficient kNN
Classification With Different Numbers of Nearest Neighbors. IEEE
Transactions on Neural Networks and Learning Systems, 29, 1774-1785.
Zheng, Y. Y., Chen, P. P., Chen, P., Chen, B. B., Wei, Wei, Wei, Wei,
Wei, Wei, Wei, Wei Wei, D. D., Wei, D., & Chen, B. &
Wang, M. (2021). M. (2021). Application of Apriori Improvement
Algorithm in Asthma Case Data Mining. Journal of Healthcare Engineering,
2021.
Zhu, X., Ying, C., Wang, J., Li, J., Lai, X., & Wang, G.
(2021). Ensemble of ML-KNN for classification algorithm
recommendation.Knowledge- Based Systems, 221, 106933.