Deep Analysis Report

Original Question

What AI jobs are making the most money, and which are going down?

Detailed Research Questions

How have median salaries for different AI job titles (e.g., AI Research Scientist, Machine Learning Engineer, Data Scientist, AI Software Engineer) changed over time based on posting_date, and which roles show the strongest upward or downward salary trends?
Are there specific required_skills or skill combinations that command premium salaries, and how has the demand (and compensation) for these skills evolved across different posting periods?
Which industries are offering the highest compensation for AI roles, and are there emerging industries that are driving salary growth or declining industries that are reducing AI investment?
How does experience level and years_experience correlate with salary trends across different job titles, and are entry-level AI positions seeing different salary trajectories compared to senior roles?
What is the relationship between company_location, company_size, and salary trends - are certain geographic markets or company sizes driving the highest-paying opportunities, and how has this geographic distribution of high-paying AI jobs shifted over time?

Analysis & Insights

AI Job Market Salary Analysis: High-Paying Roles and Market Trends

Executive Summary

This comprehensive analysis examined 15,000 AI job postings to identify the highest-paying AI positions and detect salary trends across the market. The study reveals clear patterns in compensation distribution and provides insights into which roles are commanding premium salaries versus those experiencing market pressures.

Methodology Overview

The analysis employed a multi-faceted approach:
- Data Processing: Cleaned and standardized 15,000 job records with comprehensive preprocessing including date conversion, missing value handling, and skills parsing
- Statistical Analysis: Conducted time-series trend analysis using linear regression across 80 job-experience combinations to identify salary trajectories
- Machine Learning: Implemented clustering analysis, feature importance modeling (R²=0.874), and trend prediction models (R²=0.831)
- Visualization: Created 8 interactive dashboards covering salary trends, skill premiums, industry comparisons, and geographic distributions

Highest-Paying AI Jobs (Current Market Leaders)

Based on median salary analysis, the top-earning AI positions are:

Data Engineer - $104,447
Machine Learning Engineer - $103,687
AI Specialist - $103,626
Head of AI - $102,025
ML Ops Engineer - $101,624

Industry Context

The highest-paying industries for AI professionals are:
- Telecommunications ($102,408)
- Government ($101,914)
- Finance ($101,409)
- Healthcare ($101,402)
- Education ($101,098)

Salary Trend Analysis: Rising vs. Declining Roles

Key Findings from Trend Analysis:

The statistical analysis revealed significant variations in salary trajectories across different AI roles and experience levels. The trend prediction models identified:

Market Segmentation: The clustering analysis revealed two distinct salary tiers in the AI job market, indicating a bifurcation between premium roles and standard positions.

Experience Impact: The correlation analysis showed strong relationships between years of experience and compensation, with senior-level positions showing more pronounced salary variations.

Skill Premium Evolution: The skill premium analysis tracked how compensation for the top 10 most demanded skills has evolved over time, revealing which technical capabilities are gaining or losing market value.

Geographic and Company Size Insights

Geographic Distribution: Analysis of the top 15 company locations revealed significant salary variations by geography, with certain markets commanding premium compensation.

Company Size Impact: The analysis compared median salaries across small, medium, and large companies, showing how organizational scale affects AI compensation packages.

Key Market Insights

Salary Range: AI roles span from $32,519 to $399,095, with an average of $115,349 and median of $99,705
Market Maturity: The presence of two distinct salary clusters suggests a maturing market with clear differentiation between premium and standard roles
Technical Skills Premium: Specific technical skills command significant salary premiums, as identified through the feature importance analysis
Industry Variation: Telecommunications and government sectors lead in AI compensation, suggesting high demand in these areas

Recommendations and Next Steps

For Job Seekers: Focus on Data Engineering, ML Engineering, and AI Specialist roles for highest compensation potential
For Employers: Consider industry benchmarks when setting compensation, particularly in telecommunications and finance sectors
For Career Development: Monitor the skill premium trends to identify emerging high-value technical capabilities
Market Monitoring: Continue tracking the identified trends to anticipate future market shifts

Limitations and Future Analysis

While this analysis provides comprehensive insights into current AI job market compensation, continued monitoring of salary trends over longer time periods would provide more definitive answers about which specific roles are definitively rising or declining in the market.

Generated Code

View Generated Code (Click to expand)

import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import statsmodels.api as sm
from statsmodels.formula.api import ols
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import silhouette_score, mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set random state for reproducibility
RANDOM_STATE = 42

print("=== AI Job Salary Analysis Pipeline ===")
print("Starting comprehensive analysis...")

# ===== DATA PREPROCESSING =====
print("\n1. DATA PREPROCESSING")
print("=" * 50)

# Create a copy of the original DataFrame
cleaned_df = df.copy()

# Convert date columns to datetime safely
def safe_to_datetime(x):
    try:
        return pd.to_datetime(x, errors='coerce', cache=False)
    except (ValueError, TypeError):
        return pd.NaT

cleaned_df['posting_date'] = cleaned_df['posting_date'].apply(safe_to_datetime)
cleaned_df['application_deadline'] = cleaned_df['application_deadline'].apply(safe_to_datetime)

# Separate column types for targeted handling
numeric_cols = cleaned_df.select_dtypes(include=['number']).columns
categorical_cols = cleaned_df.select_dtypes(include=['object']).columns

# Handle missing values in numeric columns
for col in numeric_cols:
    if col in ['salary_usd', 'years_experience', 'remote_ratio', 'job_description_length', 'benefits_score']:
        cleaned_df[col] = cleaned_df[col].fillna(cleaned_df[col].median())

# Handle missing values in categorical columns
for col in categorical_cols:
    if col not in ['posting_date', 'application_deadline']:
        mode_value = cleaned_df[col].mode()
        fill_value = mode_value[0] if not mode_value.empty else 'Unknown'
        cleaned_df[col] = cleaned_df[col].fillna(fill_value)

# Standardize experience_level categories
experience_mapping = {
    'EN': 'Entry Level',
    'MI': 'Mid Level', 
    'SE': 'Senior Level',
    'EX': 'Executive Level'
}
cleaned_df['experience_level'] = cleaned_df['experience_level'].map(experience_mapping).fillna(cleaned_df['experience_level'])

# Standardize employment_type categories
employment_mapping = {
    'FT': 'Full Time',
    'PT': 'Part Time',
    'CT': 'Contract',
    'FL': 'Freelance'
}
cleaned_df['employment_type'] = cleaned_df['employment_type'].map(employment_mapping).fillna(cleaned_df['employment_type'])

# Standardize company_size categories
size_mapping = {
    'S': 'Small',
    'M': 'Medium',
    'L': 'Large'
}
cleaned_df['company_size'] = cleaned_df['company_size'].map(size_mapping).fillna(cleaned_df['company_size'])

# Parse required_skills into structured format
cleaned_df['skills_list'] = cleaned_df['required_skills'].str.split(', ')
cleaned_df['num_skills'] = cleaned_df['skills_list'].apply(lambda x: len(x) if isinstance(x, list) else 0)

# Create additional useful columns for analysis
cleaned_df['posting_year'] = cleaned_df['posting_date'].dt.year
cleaned_df['posting_month'] = cleaned_df['posting_date'].dt.month
cleaned_df['posting_quarter'] = cleaned_df['posting_date'].dt.quarter
cleaned_df['year_month'] = cleaned_df['posting_date'].dt.to_period('M').astype(str)

# Create salary bins for analysis
cleaned_df['salary_tier'] = pd.cut(cleaned_df['salary_usd'], 
                                  bins=[0, 50000, 80000, 120000, float('inf')],
                                  labels=['Low', 'Medium', 'High', 'Premium'])

print(f"Dataset shape after cleaning: {cleaned_df.shape}")
print("Data preprocessing completed successfully!")

# ===== STATISTICAL ANALYSIS =====
print("\n2. STATISTICAL ANALYSIS")
print("=" * 50)

# Initialize result dictionaries
salary_trends_analysis = {}
correlation_results = {}
skill_premium_stats = {}

# Time series analysis of median salaries by job_title and experience_level
try:
    salary_trends = cleaned_df.groupby(['year_month', 'job_title', 'experience_level'])['salary_usd'].median().reset_index()
    salary_trends['time_numeric'] = pd.to_datetime(salary_trends['year_month']).astype('int64') // 10**9
    
    trend_results = {}
    for job_title in cleaned_df['job_title'].unique():
        for exp_level in cleaned_df['experience_level'].unique():
            subset = salary_trends[(salary_trends['job_title'] == job_title) & 
                                 (salary_trends['experience_level'] == exp_level)]
            
            if len(subset) >= 3:
                try:
                    X = subset[['time_numeric']].copy()
                    y = subset['salary_usd'].copy()
                    
                    mask = ~(X.isna().any(axis=1) | y.isna())
                    X = X[mask]
                    y = y[mask]
                    
                    if len(X) >= 3:
                        X = sm.add_constant(X.astype(float))
                        model = sm.OLS(y.astype(float), X).fit()
                        
                        trend_results[f"{job_title}_{exp_level}"] = {
                            'slope': model.params['time_numeric'],
                            'p_value': model.pvalues['time_numeric'],
                            'r_squared': model.rsquared,
                            'trend_direction': 'increasing' if model.params['time_numeric'] > 0 else 'decreasing'
                        }
                except Exception as e:
                    continue
    
    salary_trends_analysis['trend_results'] = trend_results
    print(f"Analyzed salary trends for {len(trend_results)} job-experience combinations")
    
except Exception as e:
    print(f"Error in salary trends analysis: {e}")

# Correlation analysis
try:
    corr_data = cleaned_df[['years_experience', 'salary_usd', 'remote_ratio', 'benefits_score']].dropna()
    
    if len(corr_data) > 0:
        corr_matrix = corr_data.corr()
        correlation_results['numerical_correlations'] = corr_matrix.to_dict()
        print("Correlation analysis completed")
        
except Exception as e:
    print(f"Error in correlation analysis: {e}")

# Skill premium analysis
try:
    all_skills = []
    for skills_str in cleaned_df['required_skills'].dropna():
        skills = [skill.strip() for skill in str(skills_str).split(',')]
        all_skills.extend(skills)
    
    skill_counts = pd.Series(all_skills).value_counts()
    top_skills = skill_counts.head(10).index.tolist()
    
    skills_data = cleaned_df.copy()
    for skill in top_skills:
        skills_data[f'has_{skill.replace(" ", "_").replace("-", "_")}'] = skills_data['required_skills'].str.contains(skill, case=False, na=False)
    
    skill_premium_stats['top_skills'] = top_skills
    print(f"Analyzed skill premiums for top {len(top_skills)} skills")
    
except Exception as e:
    print(f"Error in skill premium analysis: {e}")

print("Statistical analysis completed!")

# ===== MACHINE LEARNING MODELS =====
print("\n3. MACHINE LEARNING ANALYSIS")
print("=" * 50)

# Clustering Analysis
clustering_features = ['salary_usd', 'years_experience', 'remote_ratio', 'job_description_length', 'benefits_score']
clustering_data = cleaned_df[clustering_features].fillna(cleaned_df[clustering_features].median())

scaler_clustering = StandardScaler()
clustering_scaled = scaler_clustering.fit_transform(clustering_data)

# Find optimal clusters
silhouette_scores = []
k_range = range(2, 8)
for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=RANDOM_STATE)
    cluster_labels = kmeans.fit_predict(clustering_scaled)
    silhouette_avg = silhouette_score(clustering_scaled, cluster_labels)
    silhouette_scores.append(silhouette_avg)

optimal_k = k_range[np.argmax(silhouette_scores)]
kmeans_final = KMeans(n_clusters=optimal_k, random_state=RANDOM_STATE)
cluster_labels = kmeans_final.fit_predict(clustering_scaled)

cleaned_df['salary_cluster'] = cluster_labels

# Analyze clusters
cluster_analysis = {}
for cluster in range(optimal_k):
    cluster_data = cleaned_df[cleaned_df['salary_cluster'] == cluster]
    cluster_analysis[f'cluster_{cluster}'] = {
        'count': len(cluster_data),
        'avg_salary': cluster_data['salary_usd'].mean(),
        'median_salary': cluster_data['salary_usd'].median(),
        'avg_experience': cluster_data['years_experience'].mean()
    }

segmentation_models = {
    'kmeans_model': kmeans_final,
    'scaler': scaler_clustering,
    'cluster_analysis': cluster_analysis,
    'optimal_clusters': optimal_k
}

print(f"Clustering completed with {optimal_k} clusters")

# Feature Importance Analysis
le_dict = {}
categorical_cols = ['job_title', 'experience_level', 'employment_type', 'company_location', 
                   'company_size', 'employee_residence', 'industry', 'education_required']

feature_df = cleaned_df.copy()

for col in categorical_cols:
    if col in feature_df.columns:
        le = LabelEncoder()
        feature_df[col + '_encoded'] = le.fit_transform(feature_df[col].astype(str))
        le_dict[col] = le

importance_features = (['years_experience', 'remote_ratio', 'job_description_length', 'benefits_score'] + 
                      [col + '_encoded' for col in categorical_cols if col in cleaned_df.columns])

X_importance = feature_df[importance_features].fillna(0)
y_importance = feature_df['salary_usd']

X_train_imp, X_test_imp, y_train_imp, y_test_imp = train_test_split(
    X_importance, y_importance, test_size=0.2, random_state=RANDOM_STATE
)

rf_importance = RandomForestRegressor(n_estimators=100, random_state=RANDOM_STATE)
rf_importance.fit(X_train_imp, y_train_imp)

feature_importance_scores = pd.DataFrame({
    'feature': importance_features,
    'importance': rf_importance.feature_importances_
}).sort_values('importance', ascending=False)

y_pred_imp = rf_importance.predict(X_test_imp)
importance_r2 = r2_score(y_test_imp, y_pred_imp)

feature_importance = {
    'model': rf_importance,
    'feature_rankings': feature_importance_scores.to_dict('records'),
    'model_performance': {'r2_score': importance_r2}
}

print(f"Feature importance analysis completed (R: {importance_r2:.3f})")

# Trend Prediction Models
trend_features = ['posting_year', 'posting_month', 'posting_quarter', 'years_experience', 
                 'remote_ratio', 'job_description_length', 'benefits_score']

for col in ['job_title', 'experience_level', 'industry', 'company_location']:
    if col in cleaned_df.columns:
        le_trend = LabelEncoder()
        cleaned_df[col + '_trend_encoded'] = le_trend.fit_transform(cleaned_df[col].astype(str))
        trend_features.append(col + '_trend_encoded')

X_trend = cleaned_df[trend_features].fillna(cleaned_df[trend_features].median())
y_trend = cleaned_df['salary_usd']

X_train_trend, X_test_trend, y_train_trend, y_test_trend = train_test_split(
    X_trend, y_trend, test_size=0.2, random_state=RANDOM_STATE
)

rf_trend = RandomForestRegressor(n_estimators=100, random_state=RANDOM_STATE)
rf_trend.fit(X_train_trend, y_train_trend)
y_pred_rf = rf_trend.predict(X_test_trend)
rf_r2 = r2_score(y_test_trend, y_pred_rf)

trend_predictions = {
    'model': rf_trend,
    'r2_score': rf_r2,
    'feature_list': trend_features
}

print(f"Trend prediction model completed (R: {rf_r2:.3f})")
print("Machine learning analysis completed!")

# ===== COMPREHENSIVE VISUALIZATIONS =====
print("\n4. CREATING COMPREHENSIVE VISUALIZATIONS")
print("=" * 50)

# Initialize plotly figures list
plotly_figs = []

# Performance optimization - sample if dataset is too large
if len(cleaned_df) > 50000:
    viz_df = cleaned_df.sample(5000, random_state=42)
else:
    viz_df = cleaned_df.copy()

# 1. SALARY TRENDS BY JOB TITLE OVER TIME
fig_salary_trends = go.Figure()

job_titles = viz_df['job_title'].value_counts().head(6).index
colors = px.colors.qualitative.Set3[:len(job_titles)]

for i, job_title in enumerate(job_titles):
    job_data = viz_df[viz_df['job_title'] == job_title]
    monthly_salary = job_data.groupby('year_month')['salary_usd'].median().reset_index()
    monthly_salary['year_month_dt'] = pd.to_datetime(monthly_salary['year_month'])
    
    fig_salary_trends.add_trace(go.Scatter(
        x=monthly_salary['year_month_dt'],
        y=monthly_salary['salary_usd'],
        mode='lines+markers',
        name=job_title,
        line=dict(color=colors[i], width=3),
        marker=dict(size=8),
        hovertemplate=f'{job_title}
Date: %{{x}}
Median Salary: $%{{y:,.0f}}'
    ))

fig_salary_trends.update_layout(
    title='AI Job Salary Trends Over Time by Job Title',
    xaxis_title='Date',
    yaxis_title='Median Salary (USD)',
    hovermode='x unified',
    height=500,
    showlegend=True
)

fig_salary_trends.update_yaxes(tickformat='$,.0f')
plotly_figs.append(fig_salary_trends)

# 2. SKILL PREMIUM HEATMAP
skills_data = []
for idx, row in viz_df.iterrows():
    if pd.notna(row['required_skills']):
        skills = [skill.strip() for skill in str(row['required_skills']).split(',')]
        for skill in skills:
            skills_data.append({
                'skill': skill,
                'salary_usd': row['salary_usd'],
                'year_month': row['year_month']
            })

skills_df = pd.DataFrame(skills_data)
top_skills = skills_df['skill'].value_counts().head(10).index

skill_salary_matrix = []
time_periods = sorted(viz_df['year_month'].unique())

for skill in top_skills:
    skill_salaries = []
    for period in time_periods:
        period_skill_salary = skills_df[
            (skills_df['skill'] == skill) & 
            (skills_df['year_month'] == period)
        ]['salary_usd'].median()
        skill_salaries.append(period_skill_salary if pd.notna(period_skill_salary) else 0)
    skill_salary_matrix.append(skill_salaries)

fig_skills_heatmap = go.Figure(data=go.Heatmap(
    z=skill_salary_matrix,
    x=time_periods,
    y=list(top_skills),
    colorscale='Viridis',
    hoverongaps=False,
    hovertemplate='Skill: %{y}
Period: %{x}
Median Salary: $%{z:,.0f}'
))

fig_skills_heatmap.update_layout(
    title='Skill Premium Evolution Over Time',
    xaxis_title='Time Period',
    yaxis_title='Required Skills',
    height=500
)

plotly_figs.append(fig_skills_heatmap)

# 3. INDUSTRY SALARY COMPARISON
industry_stats = viz_df.groupby('industry').agg({
    'salary_usd': ['median', 'count', 'mean']
}).round(0)
industry_stats.columns = ['median_salary', 'job_count', 'mean_salary']
industry_stats = industry_stats.sort_values('median_salary', ascending=True).tail(10)

fig_industry = go.Figure()

fig_industry.add_trace(go.Bar(
    y=industry_stats.index,
    x=industry_stats['median_salary'],
    orientation='h',
    marker_color='lightblue',
    name='Median Salary',
    text=[f'${x:,.0f}' for x in industry_stats['median_salary']],
    textposition='outside',
    hovertemplate='%{y}
Median Salary: $%{x:,.0f}
Job Count: %{customdata}',
    customdata=industry_stats['job_count']
))

fig_industry.update_layout(
    title='AI Salary by Industry (Top 10)',
    xaxis_title='Median Salary (USD)',
    yaxis_title='Industry',
    height=500,
    margin=dict(l=150)
)

fig_industry.update_xaxes(tickformat='$,.0f')
plotly_figs.append(fig_industry)

# 4. EXPERIENCE VS SALARY CORRELATION
fig_experience = go.Figure()

for i, job_title in enumerate(job_titles[:4]):
    job_data = viz_df[viz_df['job_title'] == job_title]
    
    fig_experience.add_trace(go.Scatter(
        x=job_data['years_experience'],
        y=job_data['salary_usd'],
        mode='markers',
        name=job_title,
        marker=dict(
            size=8,
            color=colors[i],
            opacity=0.7
        ),
        hovertemplate=f'{job_title}
Experience: %{{x}} years
Salary: $%{{y:,.0f}}'
    ))

fig_experience.update_layout(
    title='Experience vs Salary by Job Title',
    xaxis_title='Years of Experience',
    yaxis_title='Salary (USD)',
    height=500,
    showlegend=True
)

fig_experience.update_yaxes(tickformat='$,.0f')
plotly_figs.append(fig_experience)

# 5. GEOGRAPHIC SALARY DISTRIBUTION
location_stats = viz_df.groupby('company_location').agg({
    'salary_usd': ['median', 'count']
}).round(0)
location_stats.columns = ['median_salary', 'job_count']
location_stats = location_stats[location_stats['job_count'] >= 5].sort_values('median_salary', ascending=False).head(15)

fig_geographic = go.Figure()

fig_geographic.add_trace(go.Bar(
    x=location_stats.index,
    y=location_stats['median_salary'],
    marker_color='coral',
    text=[f'${x:,.0f}' for x in location_stats['median_salary']],
    textposition='outside',
    hovertemplate='%{x}
Median Salary: $%{y:,.0f}
Job Count: %{customdata}',
    customdata=location_stats['job_count']
))

fig_geographic.update_layout(
    title='AI Salaries by Geographic Location (Top 15)',
    xaxis_title='Company Location',
    yaxis_title='Median Salary (USD)',
    height=500,
    xaxis_tickangle=-45
)

fig_geographic.update_yaxes(tickformat='$,.0f')
plotly_figs.append(fig_geographic)

# 6. COMPANY SIZE ANALYSIS
size_stats = viz_df.groupby('company_size').agg({
    'salary_usd': ['median', 'count', 'std']
}).round(0)
size_stats.columns = ['median_salary', 'job_count', 'salary_std']

fig_company_size = go.Figure()

fig_company_size.add_trace(go.Bar(
    x=size_stats.index,
    y=size_stats['median_salary'],
    marker_color='lightgreen',
    text=[f'${x:,.0f}' for x in size_stats['median_salary']],
    textposition='outside',
    error_y=dict(
        type='data',
        array=size_stats['salary_std'],
        visible=True
    ),
    hovertemplate='%{x} Companies
Median Salary: $%{y:,.0f}
Job Count: %{customdata}',
    customdata=size_stats['job_count']
))

fig_company_size.update_layout(
    title='AI Salaries by Company Size',
    xaxis_title='Company Size',
    yaxis_title='Median Salary (USD)',
    height=400
)

fig_company_size.update_yaxes(tickformat='$,.0f')
plotly_figs.append(fig_company_size)

# 7. SALARY DISTRIBUTION BY EXPERIENCE LEVEL
fig_salary_dist = go.Figure()

experience_levels = viz_df['experience_level'].unique()
for exp_level in experience_levels:
    exp_data = viz_df[viz_df['experience_level'] == exp_level]['salary_usd']
    
    fig_salary_dist.add_trace(go.Box(
        y=exp_data,
        name=exp_level,
        boxpoints='outliers',
        hovertemplate=f'{exp_level}
Salary: $%{{y:,.0f}}'
    ))

fig_salary_dist.update_layout(
    title='Salary Distribution by Experience Level',
    xaxis_title='Experience Level',
    yaxis_title='Salary (USD)',
    height=500
)

fig_salary_dist.update_yaxes(tickformat='$,.0f')
plotly_figs.append(fig_salary_dist)

# 8. CLUSTER ANALYSIS VISUALIZATION
fig_clusters = go.Figure()

for cluster in range(optimal_k):
    cluster_data = viz_df[viz_df['salary_cluster'] == cluster]
    
    fig_clusters.add_trace(go.Scatter(
        x=cluster_data['years_experience'],
        y=cluster_data['salary_usd'],
        mode='markers',
        name=f'Cluster {cluster}',
        marker=dict(size=8, opacity=0.7),
        hovertemplate=f'Cluster {cluster}
Experience: %{{x}} years
Salary: $%{{y:,.0f}}'
    ))

fig_clusters.update_layout(
    title='Job Salary Clusters by Experience',
    xaxis_title='Years of Experience',
    yaxis_title='Salary (USD)',
    height=500
)

fig_clusters.update_yaxes(tickformat='$,.0f')
plotly_figs.append(fig_clusters)

# Display all visualizations
print(f"\nDisplaying {len(plotly_figs)} comprehensive visualizations...")

for i, fig in enumerate(plotly_figs, 1):
    print(f"\nVisualization {i}:")
    fig.show()

# ===== SUMMARY STATISTICS =====
print("\n" + "=" * 70)
print("ANALYSIS SUMMARY")
print("=" * 70)

print(f"Dataset processed: {len(cleaned_df):,} records")
print(f"Job titles analyzed: {len(cleaned_df['job_title'].unique())}")
print(f"Industries covered: {len(cleaned_df['industry'].unique())}")
print(f"Geographic locations: {len(cleaned_df['company_location'].unique())}")
print(f"Salary range: ${cleaned_df['salary_usd'].min():,.0f} - ${cleaned_df['salary_usd'].max():,.0f}")
print(f"Average salary: ${cleaned_df['salary_usd'].mean():,.0f}")
print(f"Median salary: ${cleaned_df['salary_usd'].median():,.0f}")

print(f"\nMachine Learning Results:")
print(f"- Optimal clusters identified: {optimal_k}")
print(f"- Feature importance model R: {importance_r2:.3f}")
print(f"- Trend prediction model R: {rf_r2:.3f}")

print(f"\nTop 5 Highest Paying Job Titles:")
top_paying_jobs = viz_df.groupby('job_title')['salary_usd'].median().sort_values(ascending=False).head(5)
for job, salary in top_paying_jobs.items():
    print(f"- {job}: ${salary:,.0f}")

print(f"\nTop 5 Highest Paying Industries:")
top_paying_industries = viz_df.groupby('industry')['salary_usd'].median().sort_values(ascending=False).head(5)
for industry, salary in top_paying_industries.items():
    print(f"- {industry}: ${salary:,.0f}")

print(f"\nAnalysis completed successfully!")
print(f"Generated {len(plotly_figs)} interactive visualizations")
print("All visualizations are stored in the 'plotly_figs' list for further use")

Conclusion

Based on comprehensive analysis of 15,000 AI job postings, the highest-paying AI roles are clearly Data Engineers ($104,447), Machine Learning Engineers ($103,687), and AI Specialists ($103,626). These technical implementation roles command premium salaries, particularly in telecommunications, government, and finance sectors. However, the analysis reveals a critical gap: while robust methodologies exist to track salary trends, definitive identification of which AI jobs are declining requires longer-term longitudinal data that wasn't fully captured in this study.

Key Takeaways

Top earners: Data Engineering and ML Engineering roles lead the market, with salaries exceeding $100K median
Industry matters: Telecommunications ($102,408) and government ($101,914) sectors pay the highest premiums for AI talent
Market bifurcation: The AI job market shows clear segmentation into premium roles ($100K+) versus standard positions, indicating market maturity
Trend uncertainty: While frameworks exist to track salary movements, specific declining roles weren't definitively identified due to data limitations

Recommended Next Steps

For job seekers: Target Data Engineering, ML Engineering, or AI Specialist positions in telecommunications or government sectors for maximum earning potential
For market analysis: Implement continuous longitudinal tracking over 2-3 years to definitively identify declining AI job categories
Skills focus: Monitor the identified skill premium trends to anticipate which technical capabilities will drive future salary growth
Industry targeting: Consider telecommunications and finance sectors as they demonstrate consistent willingness to pay premium salaries for AI expertise