Machine Learning : Linear Regression With Single Variable.
3 min readLinear Regression With Single Variable.
Linear regression with a single variable is a statistical method used to model the relationship between two variables: one independent (predictor) and one dependent (target). The goal is to find a linear equation that best fits the data points, predicting the target variable based on the predictor. This equation is typically in the form y=mx+b, where m represents the slope (how much y changes for a unit change in x) and b is the y-intercept (value of y when x=0). The line generated by this equation minimizes the difference between the actual data points and the predicted values, allowing us to make predictions about the target variable for new values of the predictor.
#!/usr/bin/env python # coding: utf-8 import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn import linear_model # Load the data from a CSV file # The dataset contains per capita income data for Canada over the years. df = pd.read_csv("canada_per_capita_income.csv") # Display the first 3 rows of the data to understand its structure # Output: # year per capita income (US$) # 0 1970 3399 # 1 1971 3768 # 2 1972 4251 df.head(3) # Check the column names of the dataset # Output: # Index(['year', 'per capita income (US$)'], dtype='object') df.columns # Rename the column 'per capita income (US$)' to 'per_capita_income_usd' for easier reference newdf = df.rename(columns={'per capita income (US$)': 'per_capita_income_usd'}) # Confirm that the column was renamed successfully # Output: # Index(['year', 'per_capita_income_usd'], dtype='object') newdf.columns # Create a scatter plot to visualize the data # X-axis: Year # Y-axis: Per Capita Income USD # Data points: Red stars get_ipython().run_line_magic('matplotlib', 'inline') plt.xlabel = "Year" plt.ylabel = "Per Capita Income USD" plt.scatter(newdf.year, newdf.per_capita_income_usd, color='red', marker='*') # Create a linear regression model reg = linear_model.LinearRegression() # Define the feature (X) and target (y) for the model # X: The years (as a 2D array) # y: The per capita income in USD X = newdf[['year']] y = newdf['per_capita_income_usd'] # Train the linear regression model using the data reg.fit(X, y) # Predict the per capita income for the year 2030 input_df = pd.DataFrame({'year': [2030]}) predicted_income = reg.predict(input_df) # Display the prediction # Output: [61506.3306846] print(predicted_income) # The model finds a linear relationship in the form of y = cx + i # Here, we find the coefficient (c) and intercept (i) of this equation coef = reg.coef_ intercept = reg.intercept_ # Display the coefficient and intercept # Output: # coef= [828.46507522] # intercept= -1632210.7578554575 print("coef=", coef) print("intercept=", intercept) # Plot the original data points along with the regression line # The regression line shows the predicted values based on the model get_ipython().run_line_magic('matplotlib', 'inline') plt.xlabel = 'Year' plt.ylabel = 'Per Capita Income USD' plt.scatter(newdf.year, newdf.per_capita_income_usd, color='red', marker='+') plt.plot(newdf.year, reg.predict(newdf[['year']]), color='blue') # Predict per capita income for every 5 years from 2018 to 2093 years5 = [year + 5 for year in range(2013, 2099, 5)] # Convert the list of years to a DataFrame input_year_df = pd.DataFrame(years5, columns=['year']) # Predict the per capita income for these years input_year_df['per_capita_usd'] = reg.predict(input_year_df) # Display the DataFrame with the predicted values # Output: # year per_capita_usd # 0 2018 47927.137157 # 1 2023 52069.462533 # 2 2028 56211.787909 # 3 2033 60354.113284 # 4 2038 64496.438660 # 5 2043 68638.764036 # 6 2048 72781.089411 # 7 2053 76923.414787 # 8 2058 81065.740162 # 9 2063 85208.065538 # 10 2068 89350.390914 # 11 2073 93492.716289 # 12 2078 97635.041665 # 13 2083 101777.367041 # 14 2088 105919.692416 # 15 2093 110062.017792 print(input_year_df) # Save the predictions to a CSV file # This file will contain the predicted per capita income for every 5 years input_year_df.to_csv("per_capita_prediction.csv", index=False)