Pandas and DataFrames

In this lesson we will be exploring data analysis using Pandas.

  • College Board talks about ideas like
    • Tools. "the ability to process data depends on users capabilities and their tools"
    • Combining Data. "combine county data sets"
    • Status on Data"determining the artist with the greatest attendance during a particular month"
    • Data poses challenge. "the need to clean data", "incomplete data"
  • From Pandas Overview -- When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean, and process your data. In pandas, a data table is called a DataFrame.

DataFrame

'''Pandas is used to gather data sets through its DataFrames implementation'''
import pandas as pd

Cleaning Data

When looking at a data set, check to see what data needs to be cleaned. Examples include:

  • Missing Data Points
  • Invalid Data
  • Inaccurate Data

Run the following code to see what needs to be cleaned

df = pd.read_json('files/grade.json')

print(df)
# What part of the data set needs to be cleaned?
# From PBL learning, what is a good time to clean data?  Hint, remember Garbage in, Garbage out?
   Student ID Year in School   GPA
0         123             12  3.57
1         246             10  4.00
2         578             12  2.78
3         469             11  3.45
4         324         Junior  4.75
5         313             20  3.33
6         145             12  2.95
7         167             10  3.90
8         235      9th Grade  3.15
9         nil              9  2.80
10        469             11  3.45
11        456             10  2.75

Extracting Info

Take a look at some features that the Pandas library has that extracts info from the dataset

DataFrame Extract Column

print(df[['GPA']])

print()

#try two columns and remove the index from print statement
print(df[['Student ID','GPA']].to_string(index=False)) #turned into string 
     GPA
0   3.57
1   4.00
2   2.78
3   3.45
4   4.75
5   3.33
6   2.95
7   3.90
8   3.15
9   2.80
10  3.45
11  2.75

Student ID  GPA
       123 3.57
       246 4.00
       578 2.78
       469 3.45
       324 4.75
       313 3.33
       145 2.95
       167 3.90
       235 3.15
       nil 2.80
       469 3.45
       456 2.75

DataFrame Sort

print(df.sort_values(by=['GPA']))

print()

#sort the values in reverse order
print(df.sort_values(by=['GPA'], ascending=False))
   Student ID Year in School   GPA
11        456             10  2.75
2         578             12  2.78
9         nil              9  2.80
6         145             12  2.95
8         235      9th Grade  3.15
5         313             20  3.33
3         469             11  3.45
10        469             11  3.45
0         123             12  3.57
7         167             10  3.90
1         246             10  4.00
4         324         Junior  4.75

   Student ID Year in School   GPA
4         324         Junior  4.75
1         246             10  4.00
7         167             10  3.90
0         123             12  3.57
3         469             11  3.45
10        469             11  3.45
5         313             20  3.33
8         235      9th Grade  3.15
6         145             12  2.95
9         nil              9  2.80
2         578             12  2.78
11        456             10  2.75

DataFrame Selection or Filter

print(df[df.GPA > 3.00])
   Student ID Year in School   GPA
0         123             12  3.57
1         246             10  4.00
3         469             11  3.45
4         324         Junior  4.75
5         313             20  3.33
7         167             10  3.90
8         235      9th Grade  3.15
10        469             11  3.45

DataFrame Selection Max and Min

print(df[df.GPA == df.GPA.max()])
print()
print(df[df.GPA == df.GPA.min()])
  Student ID Year in School   GPA
4        324         Junior  4.75

   Student ID Year in School   GPA
11        456             10  2.75

Create your own DataFrame

Using Pandas allows you to create your own DataFrame in Python.

Python Dictionary to Pandas DataFrame

import pandas as pd

#the data can be stored as a python dictionary
dict = {
  "calories": [420, 380, 390, 460],
  "duration": [50, 40, 45, 42]
}
#stores the data in a data frame
print("-------------Dict_to_DF------------------")
df = pd.DataFrame(dict)
print(df)

print("----------Dict_to_DF_labels--------------")

#or with the index argument, you can label rows.
df = pd.DataFrame(dict, index = ["day1", "day2", "day3", "day4"])
print(df)
-------------Dict_to_DF------------------
   calories  duration
0       420        50
1       380        40
2       390        45
3       460        42
----------Dict_to_DF_labels--------------
      calories  duration
day1       420        50
day2       380        40
day3       390        45
day4       460        42

Examine DataFrame Rows

print("-------Examine Selected Rows---------")
#use a list for multiple labels:
print(df.loc[["day1", "day3"]])

#refer to the row index:
print("--------Examine Single Row-----------")
print(df.loc["day1"])

#output int64: a 64 bit integer 
-------Examine Selected Rows---------
      calories  duration
day1       420        50
day3       390        45
--------Examine Single Row-----------
calories    420
duration     50
Name: day1, dtype: int64

Pandas DataFrame Information

print(df.info())
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, day1 to day4
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   calories  4 non-null      int64
 1   duration  4 non-null      int64
dtypes: int64(2)
memory usage: 96.0+ bytes
None

Example of larger data set

Pandas can read CSV and many other types of files, run the following code to see more features with a larger data set

import pandas as pd

#read csv and sort 'Duration' largest to smallest
df = pd.read_csv('files/data.csv').sort_values(by=['Duration'], ascending=False)

print("--Duration Top 10---------")
print(df.head(10)) #head is beginning

print("--Duration Bottom 10------")
print(df.tail(10)) #tail is ending 
--Duration Top 10---------
     Duration  Pulse  Maxpulse  Calories
69        300    108       143    1500.2
79        270    100       131    1729.0
109       210    137       184    1860.4
60        210    108       160    1376.0
106       180     90       120     800.3
90        180    101       127     600.1
65        180     90       130     800.4
61        160    110       137    1034.4
62        160    109       135     853.0
67        150    107       130     816.0
--Duration Bottom 10------
     Duration  Pulse  Maxpulse  Calories
68         20    106       136     110.4
100        20     95       112      77.7
89         20     83       107      50.3
135        20    136       156     189.0
94         20    150       171     127.4
95         20    151       168     229.4
139        20    141       162     222.4
64         20    110       130     131.4
112        15    124       139     124.2
93         15     80       100      50.5

APIs are a Source for Writing Programs with Data

3rd Party APIs are a great source for creating Pandas Data Frames.

  • Data can be fetched and resulting json can be placed into a Data Frame
  • Observe output, this looks very similar to a Database
'''Pandas can be used to analyze data'''
import pandas as pd
import requests

def fetch():
    '''Obtain data from an endpoint'''
    url = "https://flask.nighthawkcodingsociety.com/api/covid/"
    fetch = requests.get(url)
    json = fetch.json()

    # filter data for requirement
    df = pd.DataFrame(json['countries_stat'])  # filter endpoint for country stats
    print(df.loc[0:5, 'country_name':'deaths']) # show row 0 through 5 and columns country_name through deaths
    
fetch()
  country_name       cases     deaths
0          USA  82,649,779  1,018,316
1        India  43,057,545    522,193
2       Brazil  30,345,654    662,663
3       France  28,244,977    145,020
4      Germany  24,109,433    134,624
5           UK  21,933,206    173,352

Hacks

Testing IMDB dataset for different data sets on movies:

import pandas as pd
movies = pd.read_csv('files/tmdb-movies.csv')
movies.head() #to preview the data set
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. ... Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/15 5562 6.5 2015 1.379999e+08 1.392446e+09
1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller What a Lovely Day. ... An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 5/13/15 6185 7.1 2015 1.379999e+08 3.481613e+08
2 262500 tt2908446 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... http://www.thedivergentseries.movie/#insurgent Robert Schwentke One Choice Can Destroy You ... Beatrice Prior must confront her inner demons ... 119 Adventure|Science Fiction|Thriller Summit Entertainment|Mandeville Films|Red Wago... 3/18/15 2480 6.3 2015 1.012000e+08 2.716190e+08
3 140607 tt2488496 11.173104 200000000 2068178225 Star Wars: The Force Awakens Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... http://www.starwars.com/films/star-wars-episod... J.J. Abrams Every generation has a story. ... Thirty years after defeating the Galactic Empi... 136 Action|Adventure|Science Fiction|Fantasy Lucasfilm|Truenorth Productions|Bad Robot 12/15/15 5292 7.5 2015 1.839999e+08 1.902723e+09
4 168259 tt2820852 9.335014 190000000 1506249360 Furious 7 Vin Diesel|Paul Walker|Jason Statham|Michelle ... http://www.furious7.com/ James Wan Vengeance Hits Home ... Deckard Shaw seeks revenge against Dominic Tor... 137 Action|Crime|Thriller Universal Pictures|Original Film|Media Rights ... 4/1/15 2947 7.3 2015 1.747999e+08 1.385749e+09

5 rows × 21 columns

movies.tail()
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
10861 21 tt0060371 0.080598 0 0 The Endless Summer Michael Hynson|Robert August|Lord 'Tally Ho' B... NaN Bruce Brown NaN ... The Endless Summer, by Bruce Brown, is one of ... 95 Documentary Bruce Brown Films 6/15/66 11 7.4 1966 0.000000 0.0
10862 20379 tt0060472 0.065543 0 0 Grand Prix James Garner|Eva Marie Saint|Yves Montand|Tosh... NaN John Frankenheimer Cinerama sweeps YOU into a drama of speed and ... ... Grand Prix driver Pete Aron is fired by his te... 176 Action|Adventure|Drama Cherokee Productions|Joel Productions|Douglas ... 12/21/66 20 5.7 1966 0.000000 0.0
10863 39768 tt0060161 0.065141 0 0 Beregis Avtomobilya Innokentiy Smoktunovskiy|Oleg Efremov|Georgi Z... NaN Eldar Ryazanov NaN ... An insurance agent who moonlights as a carthie... 94 Mystery|Comedy Mosfilm 1/1/66 11 6.5 1966 0.000000 0.0
10864 21449 tt0061177 0.064317 0 0 What's Up, Tiger Lily? Tatsuya Mihashi|Akiko Wakabayashi|Mie Hama|Joh... NaN Woody Allen WOODY ALLEN STRIKES BACK! ... In comic Woody Allen's film debut, he took the... 80 Action|Comedy Benedict Pictures Corp. 11/2/66 22 5.4 1966 0.000000 0.0
10865 22293 tt0060666 0.035919 19000 0 Manos: The Hands of Fate Harold P. Warren|Tom Neyman|John Reynolds|Dian... NaN Harold P. Warren It's Shocking! It's Beyond Your Imagination! ... A family gets lost on the road and stumbles up... 74 Horror Norm-Iris 11/15/66 15 1.5 1966 127642.279154 0.0

5 rows × 21 columns

movies.columns
Index(['id', 'imdb_id', 'popularity', 'budget', 'revenue', 'original_title',
       'cast', 'homepage', 'director', 'tagline', 'keywords', 'overview',
       'runtime', 'genres', 'production_companies', 'release_date',
       'vote_count', 'vote_average', 'release_year', 'budget_adj',
       'revenue_adj'],
      dtype='object')
print(movies.sort_values(by=['popularity']))
          id    imdb_id  popularity     budget     revenue  \
6181   18729  tt0088583    0.000065          0           0   
9977   32082  tt0067217    0.000188          0           0   
6080  174323  tt2429074    0.000620          0           0   
6551   31329  tt0445570    0.000973          0           0   
6961   15412  tt0466460    0.001115          0           0   
...      ...        ...         ...        ...         ...   
2     262500  tt2908446   13.112507  110000000   295238201   
630   118340  tt2015381   14.311205  170000000   773312399   
629   157336  tt0816692   24.949134  165000000   621752480   
1      76341  tt1392190   28.419936  150000000   378436354   
0     135397  tt0369610   32.985763  150000000  1513528810   

                  original_title  \
6181     North and South, Book I   
9977                The Hospital   
6080                      G.B.F.   
6551  Mon petit doigt m'a dit...   
6961           Khosla Ka Ghosla!   
...                          ...   
2                      Insurgent   
630      Guardians of the Galaxy   
629                 Interstellar   
1             Mad Max: Fury Road   
0                 Jurassic World   

                                                   cast  \
6181  Patrick Swayze|Philip Casnoff|Kirstie Alley|Ge...   
9977  George C. Scott|Diana Rigg|Richard Dysart|Barn...   
6080  Michael J. Willett|Paul Iacono|Sasha Pieterse|...   
6551  Catherine Frot|André Dussollier|Geneviève Bu...   
6961  Anupam Kher|Boman Irani|Parvin Dabas|Tara Shar...   
...                                                 ...   
2     Shailene Woodley|Theo James|Kate Winslet|Ansel...   
630   Chris Pratt|Zoe Saldana|Dave Bautista|Vin Dies...   
629   Matthew McConaughey|Jessica Chastain|Anne Hath...   
1     Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...   
0     Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...   

                                            homepage           director  \
6181                                             NaN                NaN   
9977                                             NaN      Arthur Hiller   
6080               https://www.facebook.com/GBFmovie       Darren Stein   
6551                                             NaN      Pascal Thomas   
6961                                             NaN   Dibakar Banerjee   
...                                              ...                ...   
2     http://www.thedivergentseries.movie/#insurgent   Robert Schwentke   
630                      http://marvel.com/guardians         James Gunn   
629                http://www.interstellarmovie.net/  Christopher Nolan   
1                        http://www.madmaxmovie.com/      George Miller   
0                      http://www.jurassicworld.com/    Colin Trevorrow   

                                                tagline  ...  \
6181                                                NaN  ...   
9977                   Madness, Murder and Malpractice.  ...   
6080                    Being out has never been so in.  ...   
6551                                                NaN  ...   
6961  A Delhi-based retired middle class man tries t...  ...   
...                                                 ...  ...   
2                            One Choice Can Destroy You  ...   
630                         All heroes start somewhere.  ...   
629   Mankind was born on Earth. It was never meant ...  ...   
1                                    What a Lovely Day.  ...   
0                                     The park is open.  ...   

                                               overview runtime  \
6181  Two friends, one northern and one southern, st...     561   
9977  Black comedy in which a suicidal doctor strugg...     103   
6080  The bitter fight for supremacy between the thr...      92   
6551                                 No overview found.     105   
6961  Upon retirement, cranky control freak Kamal Ki...     135   
...                                                 ...     ...   
2     Beatrice Prior must confront her inner demons ...     119   
630   Light years from Earth, 26 years after being a...     121   
629   Interstellar chronicles the adventures of a gr...     169   
1     An apocalyptic story set in the furthest reach...     120   
0     Twenty-two years after the events of Jurassic ...     124   

                                         genres  \
6181                      Drama|History|Western   
9977                       Mystery|Comedy|Drama   
6080                                     Comedy   
6551                             Comedy|Mystery   
6961                                     Comedy   
...                                         ...   
2            Adventure|Science Fiction|Thriller   
630            Action|Science Fiction|Adventure   
629             Adventure|Drama|Science Fiction   
1     Action|Adventure|Science Fiction|Thriller   
0     Action|Adventure|Science Fiction|Thriller   

                                   production_companies release_date  \
6181                                                NaN      11/3/85   
9977                                 Simcha Productions     12/14/71   
6080  School Pictures|Parting Shots Media|Logolite E...     10/20/13   
6551               Rhône-Alpes Cinéma|France2 Cinéma      4/13/05   
6961                                UTV Motion Pictures      9/22/06   
...                                                 ...          ...   
2     Summit Entertainment|Mandeville Films|Red Wago...      3/18/15   
630   Marvel Studios|Moving Picture Company (MPC)|Bu...      7/30/14   
629   Paramount Pictures|Legendary Pictures|Warner B...      11/5/14   
1     Village Roadshow Pictures|Kennedy Miller Produ...      5/13/15   
0     Universal Studios|Amblin Entertainment|Legenda...       6/9/15   

     vote_count  vote_average  release_year    budget_adj   revenue_adj  
6181         17           6.0          1985  0.000000e+00  0.000000e+00  
9977         10           6.4          1971  0.000000e+00  0.000000e+00  
6080         82           6.1          2013  0.000000e+00  0.000000e+00  
6551         13           5.7          2005  0.000000e+00  0.000000e+00  
6961         10           6.8          2006  0.000000e+00  0.000000e+00  
...         ...           ...           ...           ...           ...  
2          2480           6.3          2015  1.012000e+08  2.716190e+08  
630        5612           7.9          2014  1.565855e+08  7.122911e+08  
629        6498           8.0          2014  1.519800e+08  5.726906e+08  
1          6185           7.1          2015  1.379999e+08  3.481613e+08  
0          5562           6.5          2015  1.379999e+08  1.392446e+09  

[10866 rows x 21 columns]

2.3 Quiz

from IPython.display import Image, display
from pathlib import Path 
path = Path("images/") #sets path to /_notebook/images
def image_display(image): 
    image['filename'] = path / image['file'] #sets path to image file
    display(Image(filename=image['filename'])) #uses display function from Ipython

stock_image = {'file': "compressionquiz.png"}
image_display(stock_image)

Explanation of each question:

Question 1:

Lossless data saves all data of the original, but compresses it by finding repeating patterns that can be stored as one unit \

Question 2:

If the file needs to be completely restored as the original, lossless compression must be used. Lossy compression would cause the file to lose data, which would result in a drop in quality or information

Question 3:

Since lossy compression drops data, it can allow the data to be transmitted faster. Despite being compressed, data compressed by lossless data still have limits on how far they can be compresseds

Hacks

Early Seed award

  • Add this Blog to you own Blogging site.
  • Have all lecture files saved to your files directory before Tech Talk starts. Have data.csv open in vscode. Don't tell anyone. Show to Teacher.

AP Prep

  • Add this Blog to you own Blogging site. In the Blog add notes and observations on each code cell.
  • In blog add College Board practice problems for 2.3.

The next 4 weeks, Teachers want you to improve your understanding of data. Look at the blog and others on Unit 2. Your intention is to find some things to differentiate your individual College Board project.

  • Create or Find your own dataset. The suggestion is to use a JSON file, integrating with your PBL project would be Fambulous.
When choosing a data set, think about the following:- Does it have a good sample size?    - Is there bias in the data?
- Does the data set need to be cleaned?
- What is the purpose of the data set?
- ...
  • Continue this Blog using Pandas extract info from that dataset (ex. max, min, mean, median, mode, etc.)

Hack Helpers

Here is how Mort started on this assignment by asking ChatGPT ... Regarding Python Pandas, what are some data sets that would be good for learning Pandas?

  • There are many data sets that are suitable for learning pandas, depending on your interests and the skills you want to develop. Here are some suggestions:- Titanic Dataset: This is a classic dataset for data analysis and machine learning, and is often used as an introduction to pandas. The dataset contains information about passengers on the Titanic, including their demographics, ticket class, and survival status.

    • Iris Dataset: This is another classic dataset that is often used in machine learning and data analysis courses. The dataset contains information about iris flowers, including their sepal length, sepal width, petal length, and petal width.

    • NBA Player Stats: This dataset contains information about NBA players, including their stats for various games, their positions, and their salaries. It is a great dataset for learning how to clean and manipulate data using pandas.

    • Housing Prices Dataset: This dataset contains information about housing prices in a particular city or region, including variables such as the number of bedrooms, square footage, and price. It is a great dataset for learning how to manipulate and analyze numerical data using pandas.

    • COVID-19 Dataset: This dataset contains information about the COVID-19 pandemic, including the number of cases, deaths, and recoveries in various countries and regions. It is a great dataset for learning how to manipulate and analyze time-series data using pandas.

    • World Bank Dataset: This dataset contains information about economic indicators from various countries, such as GDP, population, and poverty rates. It is a great dataset for learning how to manipulate and analyze large datasets using pandas.

  • These are just a few examples of the many datasets that are suitable for learning pandas. The best dataset for you will depend on your interests and the specific skills you want to develop.

    Follow up question, "where can I find Titanic data set?"

  • The Titanic dataset is a popular dataset for data analysis and machine learning, and it can be found on various websites and data repositories. Here are a few places where you can find the Titanic dataset:- Kaggle: The Titanic dataset is available on Kaggle, which is a popular platform for data scientists and machine learning enthusiasts. You can download the dataset from the Kaggle website after creating an account.

    • UCI Machine Learning Repository: The Titanic dataset is also available on the UCI Machine Learning Repository, which is a collection of datasets that are commonly used for machine learning research. You can download the dataset from the UCI Machine Learning Repository website.

    • Seaborn library: If you have the Seaborn library installed in your Python environment, you can load the Titanic dataset directly from the library using the following code:

      import seaborn as sns
      titanic_data = sns.load_dataset('titanic')
      

Titanic Data

Look at a sample of data.

import seaborn as sns

# Load the titanic dataset
titanic_data = sns.load_dataset('titanic')

print("Titanic Data")


print(titanic_data.columns) # titanic data set

print(titanic_data[['survived','pclass', 'sex', 'age', 'sibsp', 'parch', 'class', 'fare', 'embark_town']]) # look at selected columns
Titanic Data
Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')
     survived  pclass     sex   age  sibsp  parch   class     fare  \
0           0       3    male  22.0      1      0   Third   7.2500   
1           1       1  female  38.0      1      0   First  71.2833   
2           1       3  female  26.0      0      0   Third   7.9250   
3           1       1  female  35.0      1      0   First  53.1000   
4           0       3    male  35.0      0      0   Third   8.0500   
..        ...     ...     ...   ...    ...    ...     ...      ...   
886         0       2    male  27.0      0      0  Second  13.0000   
887         1       1  female  19.0      0      0   First  30.0000   
888         0       3  female   NaN      1      2   Third  23.4500   
889         1       1    male  26.0      0      0   First  30.0000   
890         0       3    male  32.0      0      0   Third   7.7500   

     embark_town  
0    Southampton  
1      Cherbourg  
2    Southampton  
3    Southampton  
4    Southampton  
..           ...  
886  Southampton  
887  Southampton  
888  Southampton  
889    Cherbourg  
890   Queenstown  

[891 rows x 9 columns]

Use Pandas to clean the data. Most analysis, like Machine Learning or even Pandas in general like data to be in standardized format. This is called 'Training' or 'Cleaning' data.

# Preprocess the data
from sklearn.preprocessing import OneHotEncoder


td = titanic_data
td.drop(['alive', 'who', 'adult_male', 'class', 'embark_town', 'deck'], axis=1, inplace=True)
td.dropna(inplace=True)
td['sex'] = td['sex'].apply(lambda x: 1 if x == 'male' else 0)
td['alone'] = td['alone'].apply(lambda x: 1 if x == True else 0)

# Encode categorical variables
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(td[['embarked']])
onehot = enc.transform(td[['embarked']]).toarray()
cols = ['embarked_' + val for val in enc.categories_[0]]
td[cols] = pd.DataFrame(onehot)
td.drop(['embarked'], axis=1, inplace=True)
td.dropna(inplace=True)

print(td)
     survived  pclass  sex   age  sibsp  parch      fare  alone  embarked_C  \
0           0       3    1  22.0      1      0    7.2500      0         0.0   
1           1       1    0  38.0      1      0   71.2833      0         1.0   
2           1       3    0  26.0      0      0    7.9250      1         0.0   
3           1       1    0  35.0      1      0   53.1000      0         0.0   
4           0       3    1  35.0      0      0    8.0500      1         0.0   
..        ...     ...  ...   ...    ...    ...       ...    ...         ...   
705         0       2    1  39.0      0      0   26.0000      1         0.0   
706         1       2    0  45.0      0      0   13.5000      1         0.0   
707         1       1    1  42.0      0      0   26.2875      1         0.0   
708         1       1    0  22.0      0      0  151.5500      1         0.0   
710         1       1    0  24.0      0      0   49.5042      1         1.0   

     embarked_Q  embarked_S  
0           0.0         1.0  
1           0.0         0.0  
2           0.0         1.0  
3           0.0         1.0  
4           0.0         1.0  
..          ...         ...  
705         0.0         1.0  
706         0.0         1.0  
707         1.0         0.0  
708         0.0         1.0  
710         0.0         0.0  

[564 rows x 11 columns]

The result of 'Training' data is making it easier to analyze or make conclusions. In looking at the Titanic, as you clean you would probably want to make assumptions on likely chance of survival.

This would involve analyzing various factors (such as age, gender, class, etc.) that may have affected a person's chances of survival, and using that information to make predictions about whether an individual would have survived or not.

  • Data description:- Survival - Survival (0 = No; 1 = Yes). Not included in test.csv file. - Pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)

    • Name - Name
    • Sex - Sex
    • Age - Age
    • Sibsp - Number of Siblings/Spouses Aboard
    • Parch - Number of Parents/Children Aboard
    • Ticket - Ticket Number
    • Fare - Passenger Fare
    • Cabin - Cabin
    • Embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
  • Perished Mean/Average

print(titanic_data.query("survived == 0").mean())
survived       0.000000
pclass         2.464072
sex            0.844311
age           31.073353
sibsp          0.562874
parch          0.398204
fare          24.835902
alone          0.616766
embarked_C     0.185629
embarked_Q     0.038922
embarked_S     0.775449
dtype: float64
  • Survived Mean/Average
print(td.query("survived == 1").mean())
survived       1.000000
pclass         1.878261
sex            0.326087
age           28.481522
sibsp          0.504348
parch          0.508696
fare          50.188806
alone          0.456522
embarked_C     0.152174
embarked_Q     0.034783
embarked_S     0.813043
dtype: float64

Survived Max and Min Stats

print(td.query("survived == 1").max())
print(td.query("survived == 1").min())
survived        1.0000
pclass          3.0000
sex             1.0000
age            80.0000
sibsp           4.0000
parch           5.0000
fare          512.3292
alone           1.0000
embarked_C      1.0000
embarked_Q      1.0000
embarked_S      1.0000
dtype: float64
survived      1.00
pclass        1.00
sex           0.00
age           0.75
sibsp         0.00
parch         0.00
fare          0.00
alone         0.00
embarked_C    0.00
embarked_Q    0.00
embarked_S    0.00
dtype: float64

Machine Learning

From Tutorials Point%20is,a%20consistence%20interface%20in%20Python). Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python.> Description from ChatGPT. The Titanic dataset is a popular dataset for data analysis and machine learning. In the context of machine learning, accuracy refers to the percentage of correctly classified instances in a set of predictions. In this case, the testing data is a subset of the original Titanic dataset that the decision tree model has not seen during training......After training the decision tree model on the training data, we can evaluate its performance on the testing data by making predictions on the testing data and comparing them to the actual outcomes. The accuracy of the decision tree classifier on the testing data tells us how well the model generalizes to new data that it hasn't seen before......For example, if the accuracy of the decision tree classifier on the testing data is 0.8 (or 80%), this means that 80% of the predictions made by the model on the testing data were correct....Chance of survival could be done using various machine learning techniques, including decision trees, logistic regression, or support vector machines, among others.

  • Code Below prepares data for further analysis and provides an Accuracy. IMO, you would insert a new passenger and predict survival. Datasets could be used on various factors like prediction if a player will hit a Home Run, or a Stock will go up or down.
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Split arrays or matrices into random train and test subsets.
X = td.drop('survived', axis=1)
y = td['survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a decision tree classifier
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

# Test the model
y_pred = dt.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('DecisionTreeClassifier Accuracy:', accuracy)

# Train a logistic regression model
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

# Test the model
y_pred = logreg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('LogisticRegression Accuracy:', accuracy)
DecisionTreeClassifier Accuracy: 0.7529411764705882
LogisticRegression Accuracy: 0.788235294117647
/Users/ekamjotkaire/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(