In [None]:
# -*- coding: utf-8 -*-
import os

#Read data and create pandas.DataFrame

import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
from scipy.stats import norm
from scipy.optimize import curve_fit

from numpy.polynomial import Polynomial

# Open Files

Uses the pandas library ('pd') to read data from CSV (Comma Separated Values) files:


1. pd: This is an alias for the pandas library, which is a popular Python library for data manipulation and analysis.

2. f1: This variable likely holds the file path of the CSV file that you want to read. It should be a string representing the location of the file on your system.

3. sep=';': This parameter is used to specify the delimiter used in the CSV file to separate different values. In this case, the delimiter is set to a semicolon ';'. This is because some CSV files may use different delimiters (like commas, tabs, semicolons) to separate the values, and specifying the correct delimiter is important to parse the file correctly.

4. na_values=[' null']: This parameter is used to specify a list of strings that should be treated as missing or NaN (Not a Number) values when reading the CSV file. In this case, the string ' null' (with multiple spaces) is considered a missing value. If this value is encountered in the CSV file, pandas will treat it as NaN during the data reading process.

5. pd.read_csv(): This is the pandas function used to read the CSV file and create a DataFrame (a two-dimensional tabular data structure) containing the data from the file.

In [None]:
f1 = 'FSMT_NMDB_2015_1min.txt'
f2 = 'FSMT_NMDB_2016_1min.txt'
f3 = 'FSMT_NMDB_2017_1min.txt'


dff1 = pd.read_csv(f1, sep = ';', na_values=['   null'])
dff2 = pd.read_csv(f2, sep = ';', na_values=['   null'])
dff3 = pd.read_csv(f3, sep = ';', na_values=['   null'])

In [None]:
dff1

# File Management

dff1["Date-Time"] = pd.to_datetime(dff1[' start_date_time'])

- This line of code adds a new column named "Date-Time" to the DataFrame dff1.
- The data for this new column is derived from the existing column named " start_date_time" (note that there are extra spaces in the column name).
- The pd.to_datetime() function from pandas is used to convert the values in the " start_date_time" column from their current format to pandas' datetime format.
- This format allows for easy handling and manipulation of dates and times in Python.

dff1["corr"] = dff1[' RCORR_E'].astype(np.float64)

- This line of code adds another new column named "corr" to the DataFrame dff1.
- The data for this new column is derived from the existing column named " RCORR_E" (again, note that there are extra spaces in the column name).
- The astype() function is used to convert the values in the " RCORR_E" column to the NumPy data type np.float64, which represents floating-point numbers.
- The reason for converting the data to np.float64 is to ensure that all values in the " RCORR_E" column are treated as floating-point numbers. This conversion is useful when performing numerical computations or mathematical operations on this data.

In [None]:
dff1["Date-Time"] = pd.to_datetime(dff1['  start_date_time'] )
dff1["corr"] = dff1['   RCORR_E'].astype(np.float64)

In [None]:
dff1

In [None]:
dff2["Date-Time"] = pd.to_datetime(dff2['  start_date_time'] )
dff2["corr"] = dff2['   RCORR_E'].astype(np.float64)

In [None]:
dff2

In [None]:
dff3["Date-Time"] = pd.to_datetime(dff3['  start_date_time'] )
dff3["corr"] = dff3['   RCORR_E'].astype(np.float64)

In [None]:
dff3

data = pd.concat([dff1, dff2, dff3], ignore_index=True)

- pd.concat() function from the pandas library to concatenate (combine) three DataFrames: dff1, dff2, and dff3. The resulting concatenated DataFrame is stored in the variable data.
- ignore_index=True: This parameter is set to True, which means that the concatenated DataFrame will have a new index created for the rows. When concatenating multiple DataFrames, it's common to ignore the original index values from each DataFrame and create a new continuous index for the resulting DataFrame.

In [None]:
data = pd.concat([dff1, dff2, dff3], ignore_index=True)

In [None]:
data

data.index = data['Date-Time']

- Assigns the values of the "Date-Time" column in the DataFrame data as the new index for the DataFrame

In [None]:
data.index = data['Date-Time']

In [None]:
data

data.drop(columns=['  start_date_time'],inplace=True)

- drop() method to remove specific columns from the DataFrame data.
- The drop() method in pandas is used to drop or delete columns or rows from a DataFrame.
- The inplace=True parameter is set to apply the changes directly to the data DataFrame without creating a new DataFrame.

In [None]:
data.drop(columns=['  start_date_time'],inplace=True)
data.drop(columns=['   RCORR_E'],inplace=True)
data.drop(columns=['Date-Time'],inplace=True)

In [None]:
data

# Converting bad counts (< 0) to NAN

In [None]:
plt.plot(data['corr'], 'o')

data['corr'].where(data['corr'] > 0 ,inplace=True)

- where() method in pandas to filter the values in the "corr" column of the DataFrame data.
- The where() method is used to replace values in a DataFrame that do not meet a specific condition with a specified value (data['corr'] > 0 ).
- In this case, it replaces the values in the "corr" column that are not greater than 0 with NaN (Not a Number)
- In this case, inplace=True is used to modify the "corr" column in place, so the changes will be applied directly to the DataFrame data.
- Fot the case 'inplace=False', after executing this code, "filtered_data" will be a new DataFrame containing the modified "corr" column. However, the original data DataFrame remains unchanged.

In [None]:
data['corr'].where(data['corr'] > 0 ,inplace=True)


In [None]:
plt.plot(data['corr'], 'o')

# Remove Outliers

Make data1 as a cleaned data with NAN

In [None]:
data1 = data.copy() # Create a new DataFrame named data1 that is a copy of the original DataFrame data.

In [None]:
data1

Creating "data_f" which is the data without 'NAN'

- data1.dropna(): The dropna() method is applied to the DataFrame data1. This method drops all rows that contain any NaN values.
- If a row contains at least one NaN value, it is removed entirely from the DataFrame.

In [None]:
data_f = data1.dropna()

In [None]:
data_f

Estimating Gaussian Distribution Parameters from data_f (data without 'NAN')

"mu_c, std_c = norm.fit(data_f['corr'])"

- The fit() function is used to estimate the parameters of the normal distribution that best fit the data.
- In this case, it will estimate the mean and standard deviation of the data in the "corr" column.
- mu_c, std_c = ...: The results of the fit() function are unpacked into two variables: mu_c and std_c. mu_c will store the estimated mean of the normal distribution, and std_c will store the estimated standard deviation.

In [None]:
mu_c, std_c = norm.fit(data_f['corr'])

data1[np.abs(data1['corr'] - mu_c) > 4.5*std_c] = np.nan

- In this command, data1 is the original DataFrame including 'NAN'.
- It is used to identify and replace outliers in the "corr" column.
- np.abs(data1['corr'] - mu_c) > 4.5 * std_c: This creates a Boolean mask that is True for values in the "corr" column that are greater than 4.5 standard deviations away from the mean. These are considered outliers based on this criterion.
- data1[np.abs(data1['corr'] - mu_c) > 4.5 * std_c]: This selects the rows in data1 that have outliers in the "corr" column based on the Boolean mask.
- = np.nan: Finally, the selected outlier rows are replaced with NaN values in the "corr" column.

In [None]:
data1[np.abs(data1['corr'] - mu_c) > 4.5*std_c] = np.nan

In [None]:
data1

plt.plot(data['corr'], 'o')

This line creates a scatter plot of the "corr" column from the DataFrame data. It plots the values of the "corr" column on the y-axis and uses the index of the DataFrame (data.index) on the x-axis. The 'o' argument specifies that the points in the scatter plot should be represented by circles. Each point in the plot corresponds to a row in the "corr" column of the DataFrame data.

plt.plot(data1['corr'], 'o')

This line creates a second scatter plot of the "corr" column from the DataFrame data1. Similar to the previous line, it plots the values of the "corr" column on the y-axis and uses the index of the DataFrame (data1.index) on the x-axis. The 'o' argument specifies that the points in the scatter plot should be represented by circles. Each point in the plot corresponds to a row in the "corr" column of the DataFrame data1.


By plotting both data['corr'] and data1['corr'] on the same figure, you can visually compare the distribution and data points of the "corr" column between the two DataFrames.

data['corr']: original data

data1['corr']: data that we mark the outliers as 'NAN'

In [None]:
plt.plot(data['corr'], 'o', color='blue')
plt.plot(data1['corr'], 'o', color='red')

# Save Cleaned Data to a File

In [None]:
df1 = data1.copy()

In [None]:
df1

In [None]:
df1.to_csv(r'FSMT_cleaned_data.csv')