How to Read Hours From Date From Csv Python
In [ane]: import pandas as pd In [two]: import matplotlib.pyplot equally plt
-
Air quality data
For this tutorial, air quality data virtually \(NO_2\) and Particulate thing less than two.5 micrometers is used, made available by openaq and downloaded using the py-openaq bundle. The
air_quality_no2_long.csv"
data prepare provides \(NO_2\) values for the measurement stations FR04014, BETR801 and London Westminster in respectively Paris, Antwerp and London.To raw data
In [3]: air_quality = pd . read_csv ( "information/air_quality_no2_long.csv" ) In [four]: air_quality = air_quality . rename ( columns = { "date.utc" : "datetime" }) In [five]: air_quality . head () Out[v]: city state datetime location parameter value unit of measurement 0 Paris FR 2019-06-21 00:00:00+00:00 FR04014 no2 20.0 µg/m³ 1 Paris FR 2019-06-xx 23:00:00+00:00 FR04014 no2 21.eight µg/g³ 2 Paris FR 2019-06-20 22:00:00+00:00 FR04014 no2 26.5 µg/g³ 3 Paris FR 2019-06-twenty 21:00:00+00:00 FR04014 no2 24.9 µg/m³ 4 Paris FR 2019-06-20 20:00:00+00:00 FR04014 no2 21.4 µg/thou³
In [half-dozen]: air_quality . urban center . unique () Out[6]: array(['Paris', 'Antwerpen', 'London'], dtype=object)
How to handle time series data with ease?¶
Using pandas datetime properties¶
-
I desire to work with the dates in the column
datetime
as datetime objects instead of plain textIn [7]: air_quality [ "datetime" ] = pd . to_datetime ( air_quality [ "datetime" ]) In [8]: air_quality [ "datetime" ] Out[viii]: 0 2019-06-21 00:00:00+00:00 one 2019-06-20 23:00:00+00:00 two 2019-06-20 22:00:00+00:00 3 2019-06-20 21:00:00+00:00 four 2019-06-20 twenty:00:00+00:00 ... 2063 2019-05-07 06:00:00+00:00 2064 2019-05-07 04:00:00+00:00 2065 2019-05-07 03:00:00+00:00 2066 2019-05-07 02:00:00+00:00 2067 2019-05-07 01:00:00+00:00 Name: datetime, Length: 2068, dtype: datetime64[ns, UTC]
Initially, the values in
datetime
are character strings and exercise non provide whatsoever datetime operations (e.one thousand. extract the year, 24-hour interval of the week,…). Past applying theto_datetime
function, pandas interprets the strings and convert these to datetime (i.e.datetime64[ns, UTC]
) objects. In pandas we call these datetime objects similar todatetime.datetime
from the standard library equallypandas.Timestamp
.
Note
As many information sets do contain datetime information in one of the columns, pandas input office similar pandas.read_csv()
and pandas.read_json()
can practice the transformation to dates when reading the information using the parse_dates
parameter with a listing of the columns to read equally Timestamp:
pd . read_csv ( "../information/air_quality_no2_long.csv" , parse_dates = [ "datetime" ])
Why are these pandas.Timestamp
objects useful? Let's illustrate the added value with some example cases.
What is the outset and end date of the time series information set we are working with?
In [nine]: air_quality [ "datetime" ] . min (), air_quality [ "datetime" ] . max () Out[nine]: (Timestamp('2019-05-07 01:00:00+0000', tz='UTC'), Timestamp('2019-06-21 00:00:00+0000', tz='UTC'))
Using pandas.Timestamp
for datetimes enables us to calculate with appointment data and make them comparable. Hence, nosotros can use this to get the length of our time serial:
In [x]: air_quality [ "datetime" ] . max () - air_quality [ "datetime" ] . min () Out[10]: Timedelta('44 days 23:00:00')
The result is a pandas.Timedelta
object, similar to datetime.timedelta
from the standard Python library and defining a time duration.
To user guide
The various time concepts supported by pandas are explained in the user guide section on fourth dimension related concepts.
-
I want to add together a new column to the
DataFrame
containing only the month of the measurementIn [11]: air_quality [ "calendar month" ] = air_quality [ "datetime" ] . dt . month In [12]: air_quality . head () Out[12]: city country datetime location parameter value unit month 0 Paris FR 2019-06-21 00:00:00+00:00 FR04014 no2 20.0 µg/m³ 6 one Paris FR 2019-06-20 23:00:00+00:00 FR04014 no2 21.viii µg/m³ 6 2 Paris FR 2019-06-xx 22:00:00+00:00 FR04014 no2 26.5 µg/m³ 6 iii Paris FR 2019-06-twenty 21:00:00+00:00 FR04014 no2 24.9 µg/k³ 6 four Paris FR 2019-06-xx 20:00:00+00:00 FR04014 no2 21.4 µg/k³ 6
By using
Timestamp
objects for dates, a lot of time-related backdrop are provided past pandas. For case themonth
, but tooyear
,weekofyear
,quarter
,… All of these properties are accessible by thedt
accessor.
-
What is the average \(NO_2\) concentration for each day of the week for each of the measurement locations?
In [13]: air_quality . groupby ( ....: [ air_quality [ "datetime" ] . dt . weekday , "location" ])[ "value" ] . mean () ....: Out[13]: datetime location 0 BETR801 27.875000 FR04014 24.856250 London Westminster 23.969697 one BETR801 22.214286 FR04014 30.999359 ... 5 FR04014 25.266154 London Westminster 24.977612 6 BETR801 21.896552 FR04014 23.274306 London Westminster 24.859155 Proper noun: value, Length: 21, dtype: float64
Remember the split-apply-combine pattern provided by
groupby
from the tutorial on statistics calculation? Here, we want to calculate a given statistic (eastward.g. mean \(NO_2\)) for each weekday and for each measurement location. To group on weekdays, we use the datetime propertyweekday
(with Monday=0 and Dominicus=6) of pandasTimestamp
, which is also accessible by thedt
accessor. The grouping on both locations and weekdays can be done to split the calculation of the mean on each of these combinations.Danger
As we are working with a very short fourth dimension series in these examples, the analysis does non provide a long-term representative result!
-
Plot the typical \(NO_2\) blueprint during the day of our fourth dimension series of all stations together. In other words, what is the average value for each hour of the solar day?
In [14]: fig , axs = plt . subplots ( figsize = ( 12 , 4 )) In [15]: air_quality . groupby ( air_quality [ "datetime" ] . dt . hour )[ "value" ] . mean () . plot ( ....: kind = 'bar' , rot = 0 , ax = axs ....: ) ....: Out[15]: <AxesSubplot:xlabel='datetime'> In [sixteen]: plt . xlabel ( "Hour of the day" ); # custom x label using matplotlib In [17]: plt . ylabel ( "$NO_2 (µg/m^3)$" );
Similar to the previous case, we want to calculate a given statistic (east.g. mean \(NO_2\)) for each hour of the twenty-four hour period and we tin use the split-employ-combine arroyo once more. For this case, we use the datetime property
hr
of pandasTimestamp
, which is as well accessible by thedt
accessor.
Datetime as index¶
In the tutorial on reshaping, pivot()
was introduced to reshape the data table with each of the measurements locations as a split up column:
In [18]: no_2 = air_quality . pivot ( index = "datetime" , columns = "location" , values = "value" ) In [nineteen]: no_2 . head () Out[nineteen]: location BETR801 FR04014 London Westminster datetime 2019-05-07 01:00:00+00:00 50.5 25.0 23.0 2019-05-07 02:00:00+00:00 45.0 27.7 nineteen.0 2019-05-07 03:00:00+00:00 NaN 50.4 xix.0 2019-05-07 04:00:00+00:00 NaN 61.9 xvi.0 2019-05-07 05:00:00+00:00 NaN 72.4 NaN
Note
Past pivoting the data, the datetime information became the index of the table. In general, setting a cavalcade as an index can exist achieved by the set_index
function.
Working with a datetime alphabetize (i.e. DatetimeIndex
) provides powerful functionalities. For example, nosotros do not need the dt
accessor to get the time series properties, simply have these properties available on the index directly:
In [20]: no_2 . index . year , no_2 . index . weekday Out[20]: (Int64Index([2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, ... 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019], dtype='int64', name='datetime', length=1033), Int64Index([1, one, i, ane, 1, 1, 1, 1, i, i, ... three, 3, three, 3, iii, 3, 3, 3, 3, 4], dtype='int64', name='datetime', length=1033))
Some other advantages are the user-friendly subsetting of time period or the adapted time scale on plots. Let'due south use this on our data.
-
Create a plot of the \(NO_2\) values in the different stations from the 20th of May till the finish of 21st of May
In [21]: no_2 [ "2019-05-twenty" : "2019-05-21" ] . plot ();
Past providing a string that parses to a datetime, a specific subset of the data can be selected on a
DatetimeIndex
.
To user guide
More than information on the DatetimeIndex
and the slicing by using strings is provided in the department on time series indexing.
Resample a time series to another frequency¶
-
Aggregate the current hourly time serial values to the monthly maximum value in each of the stations.
In [22]: monthly_max = no_2 . resample ( "K" ) . max () In [23]: monthly_max Out[23]: location BETR801 FR04014 London Westminster datetime 2019-05-31 00:00:00+00:00 74.5 97.0 97.0 2019-06-30 00:00:00+00:00 52.five 84.7 52.0
A very powerful method on time series data with a datetime index, is the ability to
resample()
fourth dimension series to another frequency (e.1000., converting secondly data into five-minutely data).
The resample()
method is similar to a groupby functioning:
-
information technology provides a time-based grouping, past using a cord (e.g.
M
,5H
,…) that defines the target frequency -
it requires an aggregation function such as
mean
,max
,…
When defined, the frequency of the time serial is provided by the freq
aspect:
In [24]: monthly_max . index . freq Out[24]: <MonthEnd>
-
Make a plot of the daily mean \(NO_2\) value in each of the stations.
In [25]: no_2 . resample ( "D" ) . mean () . plot ( manner = "-o" , figsize = ( 10 , 5 ));
To user guide
More details on the power of fourth dimension serial resampling
is provided in the user guide section on resampling.
Think
-
Valid appointment strings tin can be converted to datetime objects using
to_datetime
function or every bit part of read functions. -
Datetime objects in pandas support calculations, logical operations and convenient appointment-related properties using the
dt
accessor. -
A
DatetimeIndex
contains these engagement-related properties and supports convenient slicing. -
Resample
is a powerful method to change the frequency of a fourth dimension series.
Source: https://pandas.pydata.org/docs/getting_started/intro_tutorials/09_timeseries.html
0 Response to "How to Read Hours From Date From Csv Python"
Post a Comment