<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Pandas |</title><link>https://yu-cheng.co/tags/pandas/</link><atom:link href="https://yu-cheng.co/tags/pandas/index.xml" rel="self" type="application/rss+xml"/><description>Pandas</description><generator>HugoBlox Kit (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Thu, 16 Aug 2018 15:26:08 +0000</lastBuildDate><image><url>https://yu-cheng.co/media/icon_hu_87a968e0c4fc153c.png</url><title>Pandas</title><link>https://yu-cheng.co/tags/pandas/</link></image><item><title>Parsing a .xlsx file generated by Labchart</title><link>https://yu-cheng.co/blog/labchart/</link><pubDate>Thu, 16 Aug 2018 15:26:08 +0000</pubDate><guid>https://yu-cheng.co/blog/labchart/</guid><description>&lt;p&gt;My wife often complains about manually repeating a lot of steps for her research. With a little python skill, I would love to help. So I asked her what she&amp;rsquo;s been working on recently. One of her experiments aims to measure the change of pressure and volume of a mouse&amp;rsquo;s heart, the pressure-volume loop of heart contraction/relaxation.&lt;/p&gt;
&lt;p&gt;The data are collected through &lt;em&gt;LabChart Pro&lt;/em&gt;, and there are options to output to different formats, including &lt;code&gt;.txt&lt;/code&gt;, &lt;code&gt;.xlsx&lt;/code&gt;, &lt;code&gt;.mat&lt;/code&gt;. Throughout her career, her colleagues and she mostly use Excel to process and to plot the data. However, she doesn&amp;rsquo;t know any fancy tricks of Excel, and often hanlded the data manually. The more mice used for the experiments, the more times she needs to repeat. And without remembering all the processing steps, she can&amp;rsquo;t be sure that all data-processing is done consistently. It leads to another hot topic in all fields of science in recent years: &lt;strong&gt;the lack of reproducibility.&lt;/strong&gt;&lt;/p&gt;
&lt;h4 id="step-1-data-inspection"&gt;Step 1: Data inspection&lt;/h4&gt;
&lt;p&gt;I first looked into the &lt;code&gt;.xlsx&lt;/code&gt; file she gave me. The original measurements are stored into four columns: &lt;em&gt;time&lt;/em&gt;, &lt;em&gt;pressure&lt;/em&gt;, &lt;em&gt;volume&lt;/em&gt; and &lt;em&gt;loop&lt;/em&gt;, together with many lines of header summarizing the experiment setup. I decided to skip the headers for now. One file measures one mouse for a specified period. For this specific case, there are twelve PV-loops, whose starting time-steps are marked in the &lt;em&gt;loop&lt;/em&gt; column.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;import pandas as pd
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;import matplotlib.pyplot as plt
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;# Load in data using pandas:
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;cols = [&amp;#39;Time&amp;#39;,&amp;#39;LV Pressue&amp;#39;,&amp;#39;LV Volume&amp;#39;, &amp;#39;Loop&amp;#39;]
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;df= pd.read_excel(&amp;#39;336-ivc2.xlsx&amp;#39;,names=cols, usecols = &amp;#34;A:D&amp;#34;,skiprows=140)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h4 id="step-2-data-cleaning"&gt;Step 2: Data cleaning&lt;/h4&gt;
&lt;p&gt;One thing I noticed, the &lt;em&gt;loop&lt;/em&gt; column only marks the starting time-step of each loop. I want all rows from the same loop to have the same flag. I first filled the missing rows with the last available value, then extracted the loop number and converted the column to numeric.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;df=df.fillna(method=&amp;#39;ffill&amp;#39;) # Foward Fill the missing value
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;temp=df[&amp;#39;Loop&amp;#39;].apply(lambda x:x[12:14])
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;df[&amp;#39;Loop&amp;#39;]=pd.to_numeric(temp.str.replace(&amp;#39;;&amp;#39;, &amp;#39;&amp;#39;))
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h4 id="step-3-plottinggroupbyprocessing"&gt;Step 3: Plotting/Groupby/Processing&lt;/h4&gt;
&lt;p&gt;Now I can easily select and visualize each loop&amp;rsquo;s PV changes.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;df[df[&amp;#39;Loop&amp;#39;]==1][&amp;#39;LV Volume&amp;#39;].plot()
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The goal is to visualize the mean of all 12 PV loops. To do that, I first created a new &lt;em&gt;Step&lt;/em&gt; column to index the steps in each loop.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;def restep(series):
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; length=len(series)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; new_steps=[x for x in range(1,length+1)]
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; return new_steps
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;for loop in range(1,13): # Must inspect the .xlsx first
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; df.loc[df[&amp;#39;Loop&amp;#39;]==loop,&amp;#39;Step&amp;#39;]=restep(df.loc[df[&amp;#39;Loop&amp;#39;]==loop,&amp;#39;Step&amp;#39;])
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;I created another column, &lt;em&gt;PdV&lt;/em&gt;, defined as $\frac{P}{V-V_{0}}$. This is the variable we want.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;df[&amp;#39;PdV&amp;#39;]=df[&amp;#39;LV Pressure&amp;#39;]/(df[&amp;#39;LV Volume&amp;#39;]-3.902683)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The twelve &lt;em&gt;PdV&lt;/em&gt; loops look like this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;df.PdV.plot()
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;figure&gt;&lt;img src="https://yu-cheng.co/img/pvloop_loops.png"&gt;
&lt;/figure&gt;
&lt;p&gt;The &lt;em&gt;Step&lt;/em&gt; column comes in handy to plot the mean of the twelve loops.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;import seaborn as sns
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;sns.set_style(&amp;#34;white&amp;#34;)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;# Groupby eman
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;gbmean = short_df.groupby(&amp;#39;Step&amp;#39;)[&amp;#39;PdV&amp;#39;].mean()
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;ax=gbmean.plot()
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;plt.title(&amp;#39;mean PdV of 12 loops&amp;#39;)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;plt.xlabel(&amp;#39;Step&amp;#39;)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;plt.ylabel(&amp;#39;PdV&amp;#39;)
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;sns.despine()
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;plt.show()
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;figure&gt;&lt;img src="https://yu-cheng.co/img/pvloop_mean.png"&gt;
&lt;/figure&gt;
&lt;p&gt;Pandas is powerful, but I am pretty sure that some Excel gurus can do the same thing quickly. It&amp;rsquo;s fun to work on data that I have zero knowledge of. I&amp;rsquo;ll try to rewrite some of the processing steps into functions that my wife can easily apply to the new &lt;code&gt;.xlsx&lt;/code&gt; from the same experiment.&lt;/p&gt;</description></item><item><title>Fill in the missing data using Python pandas</title><link>https://yu-cheng.co/blog/pandas_missing_value/</link><pubDate>Mon, 13 Feb 2017 00:42:30 +0000</pubDate><guid>https://yu-cheng.co/blog/pandas_missing_value/</guid><description>&lt;p&gt;One of the many advantages of Python is its abundant and often powerful Libraries. For my research, besides plotting maps, I often play with time series. When it comes to manipulating and plotting time series, no other tools can beat python pandas.&lt;/p&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;At the core of Pandas are the data structures: &lt;em&gt;Series&lt;/em&gt;, &lt;em&gt;DataFrame&lt;/em&gt; and &lt;em&gt;Panel&lt;/em&gt;. The ones I used the most are the first two. A &lt;em&gt;Series&lt;/em&gt; is an array labeled with timestamps, and a &lt;em&gt;DataFrame&lt;/em&gt; consists of many &lt;em&gt;Series&lt;/em&gt;. In a real-world use case, I use pandas to generate a range of time-axis, which is then attached to my Agulhas leakage time-series. After doing that, the value at a specific timestep can be easily retrieved by calling &lt;code&gt;Series['timestamp'].&lt;/code&gt; And to plot the whole time series is as simple as &lt;code&gt;Series.plot().&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;For a &lt;em&gt;DataFrame&lt;/em&gt;, to see the key statistics of a &lt;em&gt;DataFrame&lt;/em&gt; with many columns, simply use &lt;code&gt;DataFrame.describe()&lt;/code&gt;. A table with mean, standard deviation, counts, and percentiles will then pop up. To compare multiple time series visually, naively put &lt;code&gt;DataFrame.plot().&lt;/code&gt;&lt;/p&gt;
&lt;h4 id="working-with-missing-data"&gt;Working with missing data&lt;/h4&gt;
&lt;p&gt;Recently, I am calculating the Atlantic Ocean Heat Content (OHC).&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;#headers=[&amp;#39;date&amp;#39;,&amp;#39;OHC2000&amp;#39;,&amp;#39;OHC300&amp;#39;,&amp;#39;OHC700&amp;#39;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;OHC_multilevels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;OHC_HRC07_1951-2002.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# If it&amp;#39;s pandas generated, this is much easier.&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;OHC_multilevels&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;figure&gt;&lt;img src="https://yu-cheng.co/img/output_32_1.png" width="600"&gt;&lt;figcaption&gt;
&lt;h4&gt;Atlantic OHC in multiple layers 1951-2002&lt;/h4&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;Obviously, something fishy happened near 1952 and again in 1971. Several months have values close to zero, which is unlikely. Going back to the data, I confirmed that the temperature and salinity fields of those months are missing. To clean up the time series, I first assigned &lt;code&gt;None&lt;/code&gt; to those months, and interpolate linearly using the neighboring months. Three time series in the same &lt;em&gt;DataFrame&lt;/em&gt; are processed using following two lines.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;OHC_multilevels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;OHC_multilevels&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;OHC_multilevels&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interpolate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;figure&gt;&lt;img src="https://yu-cheng.co/img/output_33_1.png" width="600"&gt;&lt;figcaption&gt;
&lt;h4&gt;filled missing data with linear interpolation&lt;/h4&gt;
&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;This is just a glimpse of the awesomness of pandas. More details can be found in the
.&lt;/p&gt;</description></item></channel></rss>