Blog

Thoughts on engineering, life in France, and technology.

An untapped data source: Grafana Loki and Promtail

Dec 22, 2022

#Loki, #Promtail, #Grafana, #data, #logs, #log

## Motivation For years, I have been using numerous different databases on an almost daily basis. Oracle, PostgreSQL, MariaDB, InfluxDB, etc.. etc.. I enjoy the challenge of writing complex queries and discovering patterns or trends in the results. I believe that most individuals in a business setting should be familiar with SQL (though this is far from reality). I am also well aware that databases are not the only place that data can be extracted. Some other data sources include: - Web scraping - Flat files (excel, csv, json) - Logs I am familiar wth all the sources named in this list with the exception of logs. Have I looked at or seen logs? Of course I have. I still refer to my **dmesg** log if I have a file system or device mounting issue. I have used numerous apache/httpd logs to troubleshoot **WSGI** applications. But, both of these cases have never merited me learning to use something like a log aggregator. Until now. For the first time in my professional career, I have found the need to extract log data which is not available anywhere else except in a mountain of logs. More specifically, logs generated by automation systems. Since I have read about various solutions for successfully analyzing log data in the past, I have decided to use this opportunity to test drive one of these solutions. I will not be using the automation logs already mentioned above for this post, but I will be using various sytems logs to better understand the tooling and log aggregate capabilities in general. ## Grafana Loki Loki is a log aggregation system, inspired by Prometheus, which was started in 2018 by Grafana labs. I will be using the local install (non-Docker on bare metal), version 2.7.1. I am not going to run through the install process. In short, the Loki archive came with a single binary included. A configuration file is required to indicate the desired location for stored data, the listening port number, etc.. The installation process is relatively straightforward and the only issues encountered were due to improper permissions. I have included my configuration file which I am using for my test drive below. ### Loki configuration file used ```yaml auth_enabled: false server: http_listen_port: 3100 grpc_listen_port: 9096 common: path_prefix: /tmp/loki storage: filesystem: chunks_directory: /tmp/loki/chunks rules_directory: /tmp/loki/rules replication_factor: 1 ring: instance_addr: 127.0.0.1 kvstore: store: inmemory schema_config: configs: - from: 2020-10-24 store: boltdb-shipper object_store: filesystem schema: v11 index: prefix: index_ period: 24h limits_config: max_query_length: 0h query_scheduler: max_outstanding_requests_per_tenant: 4096 frontend: max_outstanding_per_tenant: 4096 ruler: alertmanager_url: http://localhost:9093 analytics: reporting_enabled: false ``` I have also created a service file to start the Loki daemon quickly without the need to remember how to source the configuration file. I am including this as a reference for myself and any reader who stumbles upon this post who uses systemd. ```conf [Unit] Description=Loki service After=network.target [Service] Type=simple User=loki ExecStart=/usr/local/bin/loki-linux-amd64 -config.file /usr/local/bin/loki-local-config.yaml [Install] WantedBy=multi-user.target ``` So, now that I have Loki downloaded, configured and running, I need to start sending some log streams to the Loki api. To accomplish this task I will be using Promtail. The next section will briefly cover Promtail and also include the configuration and service file. ## Promtail Promtail should be installed on any system containing logs. Promtail runs as a background service and will monitor the log files and extract any newly appended log entries from those log files. Once extracted the log entries will be labled and pushed to the Loki server which is actively listening (at port 3100 according to the configuration file above). ### Promtail configuration file used ```yaml server: http_listen_port: 9080 grpc_listen_port: 0 positions: filename: /tmp/positions.yaml clients: - url: http://localhost:3100/loki/api/v1/push scrape_configs: - job_name: system static_configs: - targets: - localhost labels: job: varlogs __path__: /var/log/*log - job_name: apache static_configs: - targets: - localhost labels: job: apache __path__: /var/log/apache2/access.log ``` ...and the service file used to start the Promtail service daemon on my systemd system: ```conf [Unit] Description=Promtail service After=network.target [Service] Type=simple User=promtail ExecStart=/usr/local/bin/promtail-linux-amd64 -config.file /usr/local/bin/promtail-local-config.yaml [Install] WantedBy=multi-user.target ``` I would like to mention now that the Promtail configuration file can be used to create labels (think of a label as something you might want to group by during aggregation) before the log stream is sent to the Loki listener. Reading some of the documentation initially, I was under the impression that this was how Loki **should** be used. As such, the simple configuration file shown above was loaded with pipelines containing some serious regular expressions to pre-process the logs. After much frustration and after reading a multitude of Grafana community posts, I have come to the realisation that **creating Labels is best handled at query time**. Too many labels leads to issues concerning series cardinality. It is also painful to test regex by continuously stopping and restarting the Promtail daemon (I am not a regex pro in all the flavors of regex that are used today, Loki and Promtail understand _Go RE2 regex strings_). I will discuss these points more later in the post. ## Connecting to Grafana If you are familiar with Grafana, then you already know that the next step is to create a new data source, so I will do just that. For testing and for the purpose of this post, I am running Loki, Promtail and Grafana, all on the same computer. With this in mind, here is my new data source: <img src="/assets/loki_data_source.png" alt="drawing" style="max-width: 100%; height: auto; text-align: center;"/> Next I will: 1. Create a new dashboard 2. Select the newly created Loki source as my data source 3. Choose job under label filters 4. Select apache which is the job name I gave the apache access.log in the Promtail config 5. Generate a table to inspect how Loki is representing the log data ### Grafana table panel output: apache2 access.log <img src="/assets/loki_apache_table.png" alt="drawing" style="max-width: 100%; height: auto; text-align: center;"/> ### Grafana log panel output: apache2 access.log <img src="/assets/loki_grafana_log_panel.png" alt="drawing" style="max-width: 100%; height: auto; text-align: center;"/> I chose the apache access log for testing because it is relatively straight forward. The log entries are recorded in a single line rather than on multiple lines. The various data fields expected are also repeated on each and every line of the log. Let's take a look at a single log line and take a closer look at the fields: 127.0.0.1 - - [01/Jan/2023:21:33:40 +0100] "GET /grafana_local/api/search?dashboardUIDs=alP6m1c4k&limit=30 HTTP/1.1" 200 525 "http://localhost/grafana_local/?orgId=1" "Mozilla/5.0 (X11; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0" - 127.0.0.1: Is the client making a request to the server - \- : Identity of the client making request (often just a hyphen) - \- : User ID of person requesting resource - \[01/Jan/2023:21:33:40 +0100\]: Date and time of request - "GET /grafana_local/api/search?dashboardUIDs=alP6m1c4k&limit=30 HTTP/1.1": Request type and resource being requested - 200: HTTP response status code - 525: Size of object returned to client - "http://localhost/grafana_local/?orgId=1": This is the HTTP referer, which represents the address from which the request for the resource originated - "Mozilla/5.0 (X11; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0": This is the User Agent, which identifies information about the browser that the client is using to access the resource Now, let us discuss the various fields returned by Loki with no transformations performed as of yet (output of Grafana table panel screenshot shown above). There are five columns: - **Column 1**: Labels appended by Promtail in key-value pairs (currently filename & job) - **Column 2**: The timestamp of when the log line was scraped by Promtail (**I could use the timestamp from the log line itself by using the timestamp stage in the Promtail config pipeline, I chose not to for simplicity**. I will include an example of this at the end of the post) - **Column 3**: The log line in its' entirety as seen in the apache access log - **Column 4**: tsNs is the datetime stamp converted to epoch time with nanosecond precision - **Column 5**: Unique id prepended with the epoch datetime stamp Viewing the log entries in Grafana is great and all, but now the real fun begins. I would like to answer some basic questions about the data in the logs using Grafana Loki's query language, LogQL. Yes, yes.. another, different query language, I know. Apparently, it is reminiscent of PromQL, so if you know PromQL, then you have a leg up on me. ## LogQL: Exploring my access.log data I am not going to spend a great deal of time discussing the nuances of LogQL. I am not an expert and there are plenty of documents available that will do the language more justice than I could. I simply want to discuss some of the basic concepts and point out some items I struggled with when initially writing a LogQL query. To begin, the one and only required arguement, you must have to make a valid query is a stream selector. Using the access log as an example (remember the job label **apache** was created in the Promtail config): {job="apache"} The above query is valid and will return the exact five columns I had shown in the screenshot above, with the same labels. I could retrieve the exact same data using the following query to select the access log by filename: {filename="/var/log/apache2/access.log"} With this basic log stream selected, I can begin filtering and transforming the log data to make some insights. For example, if I only cared about log lines containing the **GET** method, I could filter the log line like this: {job="apache"} |= `GET` The resulting data set would only contain the log lines containing a **GET** request. Effectively, this can be thought of like grepping a log file to only output the desired lines. Inversely, all log lines not containing the word **GET** can be returned too: {job="apache"} != `GET` Moving onto creating labels. I mentioned that from my limited experience, creating Labels at query time was the best path forward vs over-complicating the Promtail config file(s). There are multiple ways to do this and your choice will most likely be made based on the shape and format of your log files. Some of these ways include: - Regex - Pattern parsing - JSON - Logfmt - Line format expressions Due to the repeated structure of my access.log, I will be using the Pattern parser to create labels. I personally find this easier than using regex. Here is an example of creating a label out of every field containing data in my log: {job="apache"} | pattern `<ip> - - [<dttm>] "<method> <resource> <protocol>" <status> <object_size> "<referer>" "<user_agent>"` The labels generated by the above pattern parse statement are __ip__, __dttm__, __method__, etc.. I can now take these labels and begin to perform aggregations such as counting, summing, etc.. If you are **not** interested in making every single field into a label (which, if you are not actively using the labels, why would you), you can use an underscore to leave out the **don't cares**. Below, only the __ip__ field will be labeled: {job="apache"} | pattern `<ip> - - [<_>] "<_> <_> <_>" <_> <_> "<_>" "<_>"` Grafana includes a cool feature which explains how your log line transformation pipeline (query) will work: <img src="/assets/loki_explain.png" alt="drawing" style="max-width: 100%; height: auto; text-align: center;"/> ...and with this simple query and minimal work, I can count ip address occurrences from my apache access log (which is not very interesting in this case, but on a high traffic server it might be). <img src="/assets/loki_countby_ip.png" alt="drawing" style="max-width: 100%; height: auto; text-align: center;"/> ...or I can see what types of resources the visitors to the server are using... <img src="/assets/loki_countby_referer.png" alt="drawing" style="max-width: 100%; height: auto; text-align: center;"/> ## Closing comments The primary complications I have experienced to-date with Grafana Loki have been: - Difficulties querying long term data (greater than one month is current default) - The max_query_length: 0h found in the Loki config will override this behavior - Struggling with overcomplicating the Promtail configuration - I have opted to create the primary data transformation pipeline at query time rather than in the config - Here is what **ChatGPT** says about the topic: <img src="/assets/chatgpt-loki.png" alt="drawing" style="max-width: 100%; height: auto; text-align: center;"/> - The interval calculation is confusing when Grafana calculates the step interval for the Bar Chart Panel - Resolution can also be toggled to varying ratios. For me, the behavior is not consistent and is unpredicatable - Aggregating by 5m, does not necessarily generate a bar at 5m intervals - I defaulted to using the Bar graph from the __Time Series Panel__ and am not surprised by the results - My aversion to the Bar Chart panel is most likely due to my own errors and ignorance In summary, I am excited to bring Loki into my current observability stack at work as a new data source. I also know that I still have a lot to learn to take full advantage of Grafana Loki's capabilities. I have promised to give a working example of extracting the log timestamp to use for analysis (rather than the time Promtail grep'd the log line) as a bonus. Please find a working config below that does just that: ### Promtail config: timestamp extraction Below is a working configuration file to extract and use the timestamp within the apache access log. Even if the timestamp being extracted does not exactly meet a specific standard (RFC3339, RFC822Z, etc..), you can use the reference date and time as shown below in the sample configuration to work around this inconvenience. Therefore, use January, 2nd of 2006, etc.. to define the custom format. ```yaml server: http_listen_port: 9080 grpc_listen_port: 0 positions: filename: /tmp/positions.yaml clients: - url: http://localhost:3100/loki/api/v1/push scrape_configs: - job_name: system static_configs: - targets: - localhost labels: job: varlogs __path__: /var/log/*log - job_name: apache pipeline_stages: - match: selector: '{job="apache"}' action: keep stages: - regex: expression: "\\[(?P<timestamp>[\\w:/]+\\s[+\\-]\\d{4})\\]" - timestamp: source: timestamp format: "02/Jan/2006:15:04:05 -0700" location: "Europe/Paris" static_configs: - targets: - localhost labels: job: apache __path__: /var/log/apache2/access.log ``` ..and that is all for now. If you made it this far, thank you for suffering through the post. Please let me know if you find a mistake in anything I have written. > There is no subject so old that something new cannot be said about it. Fyodor Dostoevsky

Has the world lost its momentum?

Dec 5, 2022

#remote #work, #teletravail

## A quick message to the reader This post is the product of working the past two long years, at two large companies, in two different countries and in two different sectors. If you expect to learn something profound reading this, I can almost assure you that you will not. My motivation is only to put my frustrations to writing and to state my case against remote work. I am of the opinion that years from now, when scholars are studying the collapse of civilization as it is today, they will most certainly attribute our fall, at least in part, to social media and remote work. Today I will focus on the latter of the two topics as it is a subject near and dear to my heart. ## The context I thought COVID-19 would be over within a few weeks. God was I wrong. Despite the fact that I am not wearing a mask this week (I did have to wear one in a meeting only last week), there has been a fundamental shift to the way we live life and conduct business in this semi-post COVID world. One of those fundamental shifts has been the normalization of remote work (known as télétravail in my new home country). I know, it sounds like a dream come true. On _Hacker News_ I frequently read individuals touting their great productivity while working remotely. They were never as productive before as they are now, sitting or lounging wherever they are, away from their colleagues and their stuffy offices while checking their morning emails. Perhaps this is entirely true, I really do not know. I can only speak to how it has felt working onsite for the past few years while many of my colleagues opted to work remotely. Maybe my experiences, being on the receiving end of remote work are unique, but I doubt this is the case. ## In the beginning <img src="/assets/dylan_birth2.jpg" alt="drawing" style="max-width: 100%; height: auto; text-align: center;"/> When the stay-at-home orders were initially issued, my employer was deemed an essential entity and therefore continued to operate 24 hours a day, 7 days a week. My son was about to be born and I was fortunate to be provided with paternity leave as a father. Of course I took every bit of the time off and then some. Most of my time off was to help my wife with the kids (especially my son who seemed to not sleep) but I also did not want to go to work. Little was known about COVID at the time, and I would check daily, with impending doom, how many more little red dots had appeared near me on the Johns Hopkins COVID dashboard. In the end, my paid time off was running low and I decided that it was high time to return to work. I was an engineer but my position also placed me in direct supervision of approximately 15 people who were required to be onsite. Due to my position, HR gave me the option to work remotely. While the offer was tempting, I do not believe I would have, nor should have been respected by my direct reports. If they had to be onsite, I should be onsite. And so I went... <img src="/assets/jh_dashboard.jpg" alt="drawing" style="max-width: 100%; height: auto; text-align: center;"/> ## Insult to injury I had two offices on different floors. Prior to the pandemic, I would frequently occupy the office which would allow me to focus on my work with the fewest amount of interruptions. I promise that I was not shirking my responsibilities, but aside from my day-to-day duties I also worked on many projects which required some quiet and my undivided attention. Now, with the pandemic in full swing, I did not have to play hide-and-seek to focus on my work, everything was quiet... in fact too quiet. Since I had been out for over a month (I _had_ been in frequent contact with my team by phone and via email), I spent my first day or two checking in on my team. How were they doing? How were their families? How has work been? This final question illicited the most responses and complaints. It seems that practically no one from management had been onsite since the stay at home order was issued (about one month). The only form of communication was coming in the form of an email sent every few days from HR and upper management to the building. These emails would announce the latest butcher's bill of COVID positive individuals in the building. Accompanying these numbers was always a message to remind everyone that their safety was the companies number one priority. This message seemed to be the root of most ill feelings amongst those the people onsite (myself included). Given the backdrop of everything happening, the messaging did feel disengenuous. To this day I shudder thinking about those emails. If this was really the case, everyone would be home, right? But, I digress. This post is not about distasteful emails but remote work and apparently most, if not all of management, including engineering was doing just that.. working remotely. ## Radio silence Initially, the quiet was a welcome reprieve from the typical chaos which permeated our factory. I had a number of ideas in my time away from work that I wanted to bring to life and the silence allowed me to do just that. I quickly noticed that when I did need to send an email to a colleague, that responses were not as punctual as they seemed to have been before the stay-at-home. Maybe it was just me? At the time, I asked myself this question (I am not the most patient individual in the world), but today I am convinced otherwise. The more and more I needed an answer from planning, product engineering, from other engineers, the more I came to the realization that people could not really be on their computers, logged in doing real *work*, all day. They just couldn't. Many of my requests were critical to customers receiving their semiconductors on time. As an example, our planning department had the capability of flagging wafer lots as a priority. A wafer lot with a priority flag of zero was the highest priority, then priority one and finally priority two. Typically, priority wafer lots were closely monitored and if placed on hold for more than 30 minutes, phones would begin to ring asking why the lot was not being processed. We had rules on the books about this very topic. I was at the very end of our factory flow in the probe and test area, so I spoke frequently with planning and would field their phone calls almost every morning. These phone calls stopped coming in almost completely. One morning I returned after a long weekend to find a priority zero lot had been on hold for longer than 72 hours and no one seemed to care. No emails, no phone calls, nothing... The 30 minute rule which was sacrosanct was being violated and nothing... Radio silence. I could share a laundry list of examples similar to this, but I will not. I assure you that there were many many more. Needless to say that if I could feel the lack of support from all of leadership in my company, so could my direct reports. After a few months of this radio silence, my wife and I decided to take the children and relocate to France. My wife is French, so this was not a random choice. Other factors in our decision were, my wife has a large family in France (our daughter is severely handicapped so the family support is a great help), the area of France we were moving to (now live in) is much safer than Dallas, Texas (where we were living) for my children to grow up, and finally, I realized that my work was just that, work. I was payed very well, I was given the tools to create great things and was left alone for the most part to be autonomous, but it really didn't feel like anyone cared anymore. So, after some introspection I decided that money isn't everything and why not take the plunge and move. I will find another job where people are engaged and passionate about what they do. So we sold the house, shipped our belongings across the Atlantic and moved. ## No more momentum Fast forward a year and a half. I have found a job doing technical things in a manufacturing environment which I enjoy (always different problems and so much to be improved). I have a great team who has looked past my rudimentary but steadily improving French (I admit my improvement is slower than I would like). By all accounts, everything is better than I could have hoped. So why am I writing this now? I am writing this because the lack of engagement, enthusiasm and passion which I felt before leaving my last place of employee, seems to have followed me across the Atlantic and contaminated Europe. Email replies or messaging frequently draw slow responses. My manager requested access for me to a specific system from someone whose primary duty is to grant that access and approximately one month came and went before I could access the system. It is so ridiculous that all I can do is laugh. I feel like the world was hobbling slowly forward before the pandemic and the worldwide closures completely stopped what little momentum we had. I deal with many suppliers in other countries and I experience this lack of engagement everywhere I look with almost every company I deal with. These are not one off observations, this seems to be how the world works now. ## First step towards a solution After much thought on the subject, I have become convinced that the first tangible step employers can take to restore that momentum is by forcing their employees back onsite. I know this is an unpopular belief, but can we all honestly say that our business' productivity right now is better than it was before the pandemic? Perhaps some institutions or companies are, but by and large I would argue that they are not. Even as a consumer it has been a struggle dealing with everyone from our old gas company, to the architect we hired to make a simple drawing (six months and the architect still has not finished our drawing). Accomplishing the simplest tasks requiring input from someone could take weeks. Is the individual on vacation? Does the person have COVID? Maybe they work in another group now? I really don't know. If we could only begin communicating with each other again, we could go back to solving good problems. And for me, walking to my colleagues office to quickly get a question answered seems so much more efficient than sending an email and waiting. I have grown tired of waiting. > "No man is an island entire of itself; every man > is a piece of the continent, a part of the main; > if a clod be washed away by the sea, Europe > is the less, as well as if a promontory were, as > well as any manner of thy friends or of thine > own were; any man's death diminishes me, > because I am involved in mankind. > And therefore never send to know for whom > the bell tolls; it tolls for thee." -John Donne: No Man is an Island

Apache Superset: Test driving a new tool with familiar metrics

Aug 2, 2021

#superset #postgresql

For years, I have been using a piece of software by the name of _Spotfire_, Tibco Spotfire to be more specific. The software falls into the category of what is referred to these days as **BI** or _Business Intelligence_ software :chart_with_upwards_trend:. While I was never a huge proponent of Spotfire because there were limitations on what it could do, I was still pretty decent at embedding R & SQL, creating templates and ultimately deploying the dxp files to the web (via WebPlayer). Spotfire is proprietary and licensing can cost big money, so I used it at work and work alone. :books: I trained many engineers at my previous employer in SQL, R, Python, Linux, etc.. But, the request I received most often, was can I train them on getting the most out of Spotfire. In fact my final few months, I was giving Spotfire trainings to various groups multiple times a week. Given that these business intelligence tools are becoming common place at most companies and that they allow people who do not necessarily nerd out on application and web developement, the capability to create their own reports or data-driven applications with relative ease... Why not spend some time to explore a relatively new Open Source BI tool by the name of Apache Superset? ## Superset: Install and test drive ### Installation According to the Apache Superset documentation found [here](https://superset.apache.org/), there are two ways to install Superset (at least for Linux): 1. Using docker-compose 2. Using pip I began by testing the software using the **docker-compose** install. The install was very straightforward and I was able to login using admin/admin as my username and password. After logging in I was able to connect to a local instance of PostgreSQL without any issue. Before getting too carried away, I decided to logout and stop and remove the containers before firing the containers immediately back up. I am happy I did beause I was greeted with an error and could not for the life of me, restart the containers. So, onto using pip in a virtual environment (option 2 above). I used the following steps to install Superset with pip: 1. Installed OS dependencies: `sudo apt-get install build-essential libssl-dev libffi-dev python3-dev python3-pip libsasl2-dev libldap2-dev` 2. Created an empty directory for my virtual environment and superset: `mkdir superset` 3. Changed directories into that empty directory: `cd superset` 4. Created the python virtual environment: `python3 -m venv venv` 5. Activated the virtual environment: `. venv/bin/activate` 6. Installed Apache Superset: `pip install apache-superset` 7. Created username and set password: `superset fab create-admin` 8. Created roles and permissions: `superset init` 9. Installed psycopg2 since I am connecting to a PostgreSQL database: `pip install psycopg2` 10. Finally, started the flask development web server: `superset run -p 8088 --with-threads --reload --debugger` After performing the above steps, I was able to open my web browser on my localhost at port 8088 and login to Apache Superset. Connecting to my PostgreSQL database server was very easy. I selected _Data_ on the tap nav bar and _Databases_ from the dropdown. Once on the database page, I selected the _+ DATABASE_ button in the upper right hand side of the page opening this menu :point_down:: <img src="/assets/superset_add_db.png" alt="drawing" style="max-width: 100%; height: auto; text-align: center;"/> The display name can be anything you really want, I would suggest something that describes the type of data stored in the database you are connecting to. My connection string for my PostgreSQL database looks like this: postgresql+psycopg2://soitgoes511:XXXXXXXXXX@192.168.0.12:5432/pulse_oximeter_historic If you believe you will ever upload CSV files for analysis, this is where you configure that. Go to _ADVANCED_ -> _Other_ -> toggle _Allow data upload_. If you have selected this option and upload a CSV file, that CSV file will be loaded as a table into the database you just created the connection for. After completing the above steps, I quickly faced an issue. Despite being able to connect to my database succesfully, when attempting a simple query, I was greeted with the below error: <figure> <img src="/assets/psycopg2_error.png" alt="drawing" style="max-width: 100%; height: auto; text-align: center;"/> <figcaption>Error appearing on each query attempt in superset if dataset contained timestamp with tz offset.</figcaption> </figure> There is still an open [issue](https://github.com/apache/superset/issues/15768) on Github concerning the error above. Long story short, the error is appearing due to a timestamped column and more specifically, a timestamped column containing an offset. To head off any headaches, if you see this error, check your **psycopg2** version. Myself as well as at least one other individual had version 2.9.1 installed and downgrading to **psycopg2==2.8.6** made the error go away. :heavy_check_mark: Now that Apache Superset is installed, my database is connected and the error is gone, I can move forward with creating datasets, charts and dashboards. #### Creating a dataset The entire spirit of business intelligence tools is to explore data, look for insights and hopefully solve some good problems. Once you have the story you would like to tell laid out on a dashboard, you can share this with your colleagues. On the job, I have made it a habit of trying to condense all of my biggest care abouts down to two or three views or dashboards. If there were certain data points co-workers or managers would ask for often, I made sure that was on one of my dashboards. This allowed me to identify issues, near real time, and be reactive. This also freed up my time to dig into new interesting problems. Some people like the mindless task of spending hours, manually compiling their data and reports, I prefer to automate it. So, to create these dashboards, we need data. Superset will allow you to import CSV files or query the database we have already connected to. Since I enjoy SQL, and have some data I would like to review stored in my PostgreSQL db, I will walk through querying my a PostgreSQL database in Superset via **SQL Lab** and then saving that dataset and query. The first step is to select _SQL Lab_ on the top nav bar, and from the dropdown, select _SQL Editor_. You will see this, but without the query and return values: <img src="/assets/superset_sql_lab.png" alt="drawing" style="max-width: 100%; height: auto; text-align: center;"/> Please note, that you select the database connection you would like to use in the upper left hand corner (pulse_ox in my case). Once I entered the query seen below, I can hit **ctrl+enter** to sanity check and review the data. Here is a closer look at my query shown in the screenshot: ```sql SELECT (time at TIME ZONE 'CEST') AS "time", ma_spo2, ma_bpm, ma_perf FROM pulse_ox_moving_average WHERE time >= (NOW() - INTERVAL '1 hour') ORDER BY time DESC; ``` Now, before I save this query and the dataset, I will remove the **WHERE** and **ORDER BY** clauses. I do not want to filter the data in the dataset or underlying query but with a filter on the dashboard. I also do not want to order the dataset since all it will do is waste compute time and is unnecessary. Also note that I am converting my time to _Central European Summer Time_. I have the timestamps stored as _UTC_ in my postgres instance and would like to review the traces in my local timezone. To save the query for later use or fine-tuning, select the blue **SAVE** button under the text box. To save the dataset and begin charting, select the blue **EXPLORE** button directly above the return results. Once named and saved, a new **EXPLORE** tab will be opened in the brower, and I can begin building my charts. #### Creating charts and filters Creating charts is fairly straight forward. There are a multitude of chart types to choose from: <img src="/assets/superset_chart_types.png" alt="drawing" style="max-width: 100%; height: auto; text-align: center;"/> I have not experimented with them all but I have with most. Since my data is time-series data, I will choose the time-series chart type (last chart in the above assortment). This chart type does not have the ability to use a left and a right y-axis, but I can create multiple traces on the same plot which share the same y-axis. Here is my first attempt of a chart: <img src="/assets/superset_explore_chart.png" alt="drawing" style="max-width: 100%; height: auto; text-align: center;"/> When creating a dashboard in Superset, you choose from your already created and available charts and/or filters. So, I will go ahead and create a simple time range filter and some more charts before I move forward with my dashboard. Here is a simple filter (found as a chart type): <img src="/assets/superset_filter.png" alt="drawing" style="max-width: 100%; height: auto; text-align: center;"/> I am even able to build forecasting into a chart using **Prophet**. The component plots are not available, but designating my confidence interval, daily, weekly and yearly seasonality is available. To utilize this feature, you must install pystan and prophet: 1. pip --no-cache-dir install pystan==2.19.1.1 2. pip install prophet As of this post, pystan >= version 3.0 will not work, hence me specifying version 2.19.1.1 above. <figure> <img src="/assets/superset_bpm_prophet.jpg" alt="drawing" style="max-width: 100%; height: auto; text-align: center;"/> <figcaption>Prophet forecast of my daughters heart rate. </figcaption> </figure> #### Putting it all together into a dashboard Once all of the charts and filters have been created, it is time to construct a dashboard. Don't fret if something needs tweaked further down the road, all you have to do is update the chart and the dashboard (assuming it contains the chart) will reflect the updates. <figure> <img src="/assets/superset_pox_dash_pic1.jpg" alt="drawing" style="max-width: 100%; height: auto; text-align: center;"/> <figcaption>Moving average pulse oximetry metrics materialized with dbt and queried from postgres.</figcaption> </figure> <figure> <img src="/assets/prophet-models-health-superset.jpg" alt="drawing" style="max-width: 100%; height: auto; text-align: center;"/> <figcaption>3-day heart rate and temperature forecast with 90% confidence interval. Forecast generated using Prophet and configured easily in Apache Superset. </figcaption> </figure> #### Closing comments I mentioned Spotfire at the beginning of this post, so, after walking through creating datasets, charts and dashboards with Apache Superset I should mention a few glaring differences between the two BI tools. - Spotfire is proprietary and Superset is open source - I commonly used R to wrangle and transform data in Spotfire as a **Data Function**, Superset allowed for only SQL (easier but limiting) - Spotfire could run as a free standing desktop application on my PC or be deployed to the web and accessed via Webplayer. Superset runs as a standalone server on some system. I have seen individuals deploying Superset to Heroku. - There were more chart types available for Superset but some of the charts I have had difficulty using (**Multiple Line Charts** specifically). Superset is still relatively new and I have faith that some of the bugs will be worked out soon. - Errors are easier to interpret (IMHO) in Superset than Spotfire. I don't know how many times I had some hidden whitespace character in my Spotfire Data Function that was preventing my function from executing. The errors would yield zero valuable insight. There are many facets of Apache Superset I did not delve into (users, roles, annotations and layers, deploying on a production server, etc...). The primary motivation for this post was to show that good business intelligence tools are available to use for free, and can help you gain insight into whatever data you are attempting to put under a microscope. Why not take advantage of them if the need arises? > “There was nowhere to go but everywhere, so just keep on rolling under the stars.” > ― Jack Kerouac, On the Road: the Original Scroll

A simple use case for dbt and Airflow

Jul 20, 2021

#rpi #postgres #dbt #airflow

My previous two posts involved installing a postgres server on a cheap, spare raspberry pi 3b+. The motivation was to save my time-series data for longer than 30-days since my free InfluxDB Cloud account, only has a 30-day retention policy. I was successful in installing a fresh OS, configuring the **rpi** to run headless, scanning the local ip's to find the rpi, ssh'ing to the single board computer, installing a **PostgreSQL** instance and finally writing a script to query my influxDB instance, transform the data and push the data to my postgresql database. I used a cronjob to execute the script with a comment about accomplishing the same task using **Apache Airflow**. So today, I am back to offload the scheduling from the cron daemon to Airflow and I am also throwing a **dbt** incremental materialization into the mix. ### Motivation for added complexity You might say that the cronjob was working just fine and if it isn't broken, why fix it :wrench: ? I even commented during the writing that using Airflow felt like bird hunting with a scud missile. Despite my comment, I have a simple and good reason to make use of Airflow. The pulse oximeter data I am collecting is very noisy. More so when my daughter moves, the sensor is not attached well or the sensor could even be defective. Whatever the case may be, outliers are not uncommon. Here is a screenshot to illustrate my point. The orange trace below represents the mean over a five second rolling window of my daughter's heartrate. So, these data have already been "smoothed" slightly considering I capture her **heartrate**, **specific oxygen** and **perfusion index** all at a frequency of one sample per second. <img src="/assets/bpm_five_sec_agg.png" alt="drawing" style="max-width: 100%; height: auto; text-align: center;"/> I would like to smooth the signal even more using something simple like a moving average. This would allow me to more easily spot trends. Since I look at my daughter's trends regularly, I need the data to be relatively up-to-date. I also take snapshots from my cloud database on an hourly basis, so it would be nice to calculate a moving average for my metrics of interest at the same time and frequency the updates occur. In this spirit, I decided to use dbt (**D**ata **B**uild **T**ool) to perform the transformation and materialization, and Airflow to ensure that this dbt model runs after and only after the data is loaded into postgresql from my cloud instance. So, let's get started. ## dbt: Install, init, model creation and test I will not delve into all of the fine details of dbt, the [documentation](https://docs.getdbt.com/docs/introduction) found online is excellent and I encourage you if you are interested in databases to check it out yourself. I will say that dbt will allow me to easily perform a transformation on data in my postgresql database using a simple query. After the transformation has been performed, dbt will handle the materialization for me. ### dbt: Install and initialization Dbt is a python module and can therefore be installed with pip. Using _pip_ I installed dbt-postgres on my rpi3b+: `python3 -m pip install dbt-postgres`. I specified PostgreSQL because I am transforming data on a postgres server and dbt will need the postgres adapter to interact with the database. Some other supported databases include: BigQuery, Redshift and Snowflake. After the install was complete, I initialized a project called **health_metrics**: $ dbt init health_metrics --adapter postgres Running with dbt=0.20.0 Creating dbt configuration folder at /home/pi/.dbt With sample profiles.yml for postgres Your new dbt project "health_metrics" was created! If this is your first time using dbt, you'll need to set up your profiles.yml file (we've created a sample file for you to connect to postgres) -- this file will tell dbt how to connect to your database. You can find this file by running: xdg-open /home/pi/.dbt For more information on how to configure the profiles.yml file, please consult the dbt documentation here: https://docs.getdbt.com/docs/configure-your-profile One more thing: Need help? Don't hesitate to reach out to us via GitHub issues or on Slack -- There's a link to our Slack group in the GitHub Readme. Happy modeling! Next, I need to update the `profiles.yml` file as indicated by the message above seen after initilizing my dbt project. Below shows my updated `profiles.yml` file with the database password excluded: ```yaml default: outputs: dev: type: postgres threads: 1 host: 192.168.0.12 port: 5432 user: soitgoes511 pass: <PASSW_OF_DB> dbname: pulse_oximeter_historic schema: public prod: type: postgres threads: 1 host: 192.168.0.12 port: 5432 user: soitgoes511 pass: <PASSW_OF_DB> dbname: pulse_oximeter_historic schema: public target: dev ``` Once my `profiles.yml` configuration has been completed, I can save the file and move on to writing the actual model. ### dbt: Model creation and test A model in it's most basic form is an **.sql** file containing a single SQL **SELECT** statement. Included in the same directory as the **.sql** file is a file named `schema.yml` containing the model name, the column names being returned by the **SELECT** statement, a brief description of each column and test assertions which can be used to validate your model is performing the way you expect it to. Further down the road, your model can be tested simply by executing `dbt test`. So, to the task at hand. I wanted to remove some noise and smooth my data and will start by calculating a simple moving average. My original data has been aggregated using the mean over a five second window before it is loaded into my postgresql database. This means I should have 12 samples per minute. I have been looking at pulse oximeter data for years now and feel confident that a five minute window will remove most of the noise without losing too much information. I can always adjust later, or revert back to my 5 second aggregated data. No harm, no foul. So, 12 samples per minute for 5 minutes equates to 60 data points (I need this for my query below). One more point to cover before getting to the actual query/model. Dbt supports templating, macros, references, etc.. The possible materializations include a table, view, incremental and ephemeral. **Table** and **View** are hopefully self-explanatory. If not, please refer to the documentation link I referenced earlier in this writing. I will be using **incremental**. Incremental for the first execution will build a complete table. For later model runs, dbt will only build the new data onto the table assuming I have the **is_incremental()** macro wrapping my filters used to specify the new data. Again, the dbt folks do a much better job explaining this and that explanation can be found [here](https://docs.getdbt.com/docs/building-a-dbt-project/building-models/configuring-incremental-models). Finally, here is the first iteration of my moving average model which will build a table, incrementally (completely at first and can be rebuilt anew if the `--full-refresh` flag is used at runtime): ```sql {% raw %} {{ config(materialized='incremental') }} {% endraw %} SELECT time, spo2, AVG(spo2) OVER( ORDER BY time ROWS BETWEEN 59 PRECEDING AND CURRENT ROW ) AS ma_spo2, bpm, AVG(bpm) OVER( ORDER BY time ROWS BETWEEN 59 PRECEDING AND CURRENT ROW ) AS ma_bpm, perf_index, AVG(perf_index) OVER( ORDER BY time ROWS BETWEEN 59 PRECEDING AND CURRENT ROW ) AS ma_perf FROM pox_five_second_mean {% raw %} {% if is_incremental() %} WHERE time > (SELECT max(time) FROM {{ this }}) {% endif %} {% endraw %} ``` The contents of my `schema.yml`: ```yaml version: 2 models: - name: pulse_ox_moving_average description: "5 minute moving average for spo2, bpm and perf" columns: - name: time description: "The primary key for this table" tests: - not_null - name: spo2 description: "SPO2 sensor read aggregated by 5 minutes from time series" tests: - not_null - name: ma_spo2 description: "5 Minute moving average of SPO2, previous 59 rows plus current" tests: - not_null - name: bpm description: "BPM sensor read aggregated by 5 minutes from time series" tests: - not_null - name: ma_bpm description: "5 Minute moving average of BPM, previous 59 rows plus current" tests: - not_null - name: perf_index description: "Perfusion Index sensor read aggregated by 5 minutes from time series" tests: - not_null - name: ma_perf description: "5 Minute moving average of Perfusion Index, previous 59 rows plus current" tests: - not_null ``` And finally, let me run the model and see what happens: $ dbt run Running with dbt=0.20.0 Found 1 model, 7 tests, 0 snapshots, 0 analyses, 147 macros, 0 operations, 0 seed files, 0 sources, 0 exposures 22:45:55 | Concurrency: 1 threads (target='dev') 22:45:55 | 22:45:55 | 1 of 1 START incremental model public.pulse_ox_moving_average........ [RUN] 22:45:56 | 1 of 1 OK created incremental model public.pulse_ox_moving_average... [INSERT 0 0 in 0.84s] 22:45:56 | 22:45:56 | Finished running 1 incremental model in 1.41s. Completed successfully Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1 Notice the relatively fast time of execution. That is because I have run this model before and there is no new data loaded to compute the moving average. The actual runtime is about 22 seconds, please bare in mind this is an oldish rpi3b+ and not a supercomputer. Also note, that if I had specified a **table** for my materialization, the entire table would have been rebuilt and this would have taken even longer than 22 seconds. If I were using a pay-per-use cloud provider, that would most likely equate to money down the drain :money_with_wings: . Here is a sample of the result of my successful model materialization: <img src="/assets/ma_post_test.png" alt="drawing" style="max-width: 100%; height: auto; text-align: center;"/> Now that the model is working and has been tested, it is time to automate the model run at the top of each hour. Let's move onto setting up Apache Airflow and creating a DAG (**D**irected **A**cyclic **G**raph). ## Apache Airflow: Install, setup and DAG creation Airflow was created by Maxime Beauchemin and used at Airbnb in 2014 to programmatically schedule tasks. Since that time, Airflow has become a part of the Apache Software Foundation. I decided to change from my simplistic cron schedular to Airflow to ensure that the moving average is not calculated until after the pulse oximeter data has been retrieved from my InfluxDB instance, shaped and loaded into the PostgreSQL database server running on my raspberry pi. I could have handled this with a cronjob in quite a few ways (e.g. Serializing the data extraction and loading with the moving average calculation and table creation in the same script or simply by skewing the timing of two different crons which is an error prone hack imho). While this is a simplistic use case, it still solves my script execution timing issue in an elegant manner (with some added complexity). :grin: ### Airflow: Install and setup I used the following steps to install Airflow on my raspberry pi. Please be aware that these steps were taken directly from the Apache Airflow docs which can be found [here](https://airflow.apache.org/docs/apache-airflow/stable/start/local.html). I have not altered nor improved these steps in any way and in no way am taking any credit for this: ```bash # airflow needs a home, ~/airflow is the default, # but you can lay foundation somewhere else if you prefer # (optional) export AIRFLOW_HOME=~/airflow AIRFLOW_VERSION=2.1.2 PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)" # For example: 3.6 CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt" # For example: https://raw.githubusercontent.com/apache/airflow/constraints-2.1.2/constraints-3.6.txt pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}" # initialize the database airflow db init airflow users create \ --username admin \ --firstname Peter \ --lastname Parker \ --role Admin \ --email spiderman@superhero.org # start the web server, default port is 8080 airflow webserver --port 8080 # start the scheduler # open a new terminal or else run webserver with ``-D`` option to run it as a daemon airflow scheduler # visit localhost:8080 in the browser and use the admin account you just # created to login. Enable the example_bash_operator dag in the home page ``` After the installation is complete, I ensured to make all necessary changes to `airflow.cfg` located at ~/airflow/ (which in my case was /home/pi/airflow/). The primary setting I needed to know was where my DAGs where located. This setting is `dags_folder = /home/pi/airflow/dags` in my case. This is where my DAG (**.py** files will go). I also did not adjust the **default_timezone**. I used **utc** time as I did for my PostgreSQL instance and as I would recommend anyone reading this should do. Timezones have caused me some real headaches over the years and I find the best practice to be, storing all data using UTC timestamps and then handling the timezones on the application end if necessary. ### Airflow: DAG creation A DAG is a Directed Acyclic Graph as I have already mentioned. Put simply, it is a graph who's nodes are tasks. Those tasks are connected from one node to the next and the overall graph contains no closed loop. This screenshot of my DAG taken from the Airflow webserver should clarify the _Graph_ concept. My tasks are the nodes and the overall graph is the name of my DAG. Notice there is directionality illustrated by an arrow pointing from **load_pg_from_influx** :arrow_right: **moving_average_calc**. Therefore, the moving average task will be executed after the loading of data to postgres task is complete: <img src="/assets/airflow_graph.jpg" alt="drawing" style="max-width: 100%; height: auto; text-align: center;"/> Below is my DAG located in `/home/pi/airflow/dags/`. I used two operators for the two different tasks: 1. dbt_operator - (from pkg **airflow-dbt** in PyPi) which makes it easy to execute my dbt model 2. BashOperator - which allows me to execute commands in Bash shell I have set the `schedule_interval` to run at the top of each hour, every hour of every day. This should be familiar to those that have scheduled a cronjob before... I can use the same syntax. ```python from airflow import DAG from airflow_dbt.operators.dbt_operator import DbtRunOperator from airflow.operators.bash import BashOperator from airflow.utils.dates import days_ago from datetime import timedelta # # The default dir contains my dbt models, the retries has been added # to handle if the cloud provider is offline for maintenance (this happened). # Finally, I allow 30 minutes to attempt a re-execution of the DAG # default_args = { 'dir': '/home/pi/dbt_world/health_metrics', 'start_date': days_ago(0), 'retries': 1, 'retry_delay': timedelta(minutes=30) } with DAG(dag_id='pulse_ox_data', default_args=default_args, schedule_interval='0 * * * *') as dag: influx_pg = BashOperator( task_id='load_pg_from_influx', bash_command='/usr/bin/python3 /home/pi/pg_loader/health_metrics_loader.py', dag=dag ) dbt_run = DbtRunOperator( task_id='moving_average_calc', ) influx_pg >> dbt_run ``` Please note the task variable influx_pg pipes into the dbt_run variable in the above script/DAG. This defines my graph directionality and order of execution. If I decide at some later point in time that I would like to materialize another view or table from the original influx_pg data, I can pipe into a list object (e.g. **t1 >> [t2, t3]**). Like I said, what I have done so far is not very complex. Despite not being very complex, I was able to solve my scheduling dilemma in an elegant manner. If this were some critical data pipeline, I could send emails to myself or my team if my tasks fail. Airflow gives much more information and ability than a vanilla cronjob would to allow for troubleshooting efficiencies, bottlenecks or failures. Last but not least, here is a screenshot of the resulting moving average (blue) for my daughter's heartrate overlayed onto the original heartrate data stored in my PostgreSQL database (orange): <img src="/assets/ma_bpm.png" alt="drawing" style="max-width: 100%; height: auto; text-align: center;"/> If someone reads this and spots an obvious mistake or has some additional insights or questions, please send me an email. > “I don't know what is behind the curtain; only that I need to find out.” > ― Richard Paul Evans, Lost December

Extract, transform & load: InfluxDB Cloud to Local PostgreSQL (Part 2)

Jul 12, 2021

#rpi #influxdb #postgresql #etl #cronjob

### Quick Recap During part 1, I walked through the following actions in the effort to keep my cloud storage data for a longer duration: 1. Burning a new OS image to a microSD 2. Setting up a raspberry pi 3b+ to run headless 3. Installing and configuring a postgresql server on the rpi 4. Testing the postgresql instance by writing data to it remotely This probably sounds backwards, if you did not read [part 1](https://soitgoes511.github.io/rpi/headless/postgresql/2021/07/10/rpi_headless_postgres.html), but to recap, I don't pay for my InfluxDB cloud account and have a 30-day retention policy on my data storage. 30-days is fine for real-time monitoring but for longer term modeling, I have installed a postgreSQL database on an oldish raspberry pi 3b+ and aim to collect longintudinal health metrics. So, without further ado... ### Extract, Transform and Load First, I would like to offer an explanation as to why I have decided to not store the data on a local instance of InfluxDB and then I will give some insight into the data I have stored on the cloud, the structure of the data and the resolution. Finally, I will share the script I will be using to perform the ETL operation. #### Why postgres rather than a local influxdb instance? Using InfluxDB OSS, seemed like the most obvious choice. This was my initial intent. What stopped me short of doing this is the lack of an InfluxDB version >=2.0 available for my rpi3B+ architecture (armv7l). If I had a spare rpi4 lying around, this would have been my choice. I do have InfluxDB version 1.8.6-1 available in my repo: influxdb/unknown,now 1.8.6-1 armhf Distributed time-series database. This version will allow me to query with both the legacy InfluxQL query language and Flux if I toggle that option within the configuration. But, I cannot write to the time-series database using Flux. I do not like the idea of being locked into the older version of InfluxDB and therefore, I chose postgres as my storage solution. #### A glimpse of the raw data The screenshot below was taken from InfluxData's GUI **Data Explorer**. A few items to note. I did not aggregate the data. I take samples from my daughter's pulse oximeter once a second. This is the same resolution I store in the cloud. The start time is the starting timeframe of the query (now() - 10s) in this case. The stop time is/was now(). The time column is the actual time of the sensor read. The value column is the sensor reading itself and is paired with the field column. The field in time-series lingo is an un-indexed column. The measurement column is analogous to the table name in a relational database which in this case is spo2 (not the best choice and I should have named it differently). Finally, UID is a tag or an indexed column. This is residual from me experimenting with multiple sensors and completely lacks any information or utility at this point-in-time. <img src="/assets/streaming_tables.png" alt="drawing" style="max-width: 100%; height: auto; text-align: center;"/> The sensor reads are all stored to InfluxDB in UTC time. I plan on doing the same for my postgresql instance. From my experience, timestamps can cause some real headaches and the clearest path is to store in UTC time and to handle any timezones on the application end while modeling. #### ETL Script So, a brief outline of what the script should do: 1. Create a time range/constraint of one hour since I will be taking snapshots every hour of everyday 2. Query InfluxDB using those created time constraints 3. Handle as much of the transformation as I can during the query, e.g. Aggregate, Pivot and Drop columns 4. Save query return to pandas dataframe 5. Drop final un-wanted columns, convert timestamp into datetime64 and make dttm an index 6. Push dataframe to postgresql instance, appending onto table if it exists This script will only execute once, therefore, since I am taking one hour snapshots, I need to schedule the script to execute once per hour. More on that shortly. ```python # Filename: health_metrics_loader.py # Author: Mike Hinkle # Purpose: Extract data from InfluxDB cloud account, sanitize # and push to local postgres instance import os import pandas as pd from influxdb_client import InfluxDBClient from sqlalchemy import create_engine from dotenv import load_dotenv from datetime import datetime, timedelta load_dotenv() # # Load secrets from dotenv file # TOKEN = os.getenv("TOKEN") ORG = os.getenv("ORG") URL = os.getenv("CONNECTION") PG_CONNECT = os.getenv("PG_CONNECT") # # Create time range to bound query # now = datetime.utcnow().strftime("%Y-%m-%dT%H:00:00Z") last = datetime.utcnow() - timedelta(hours=1) last_hour = last.strftime("%Y-%m-%dT%H:00:00Z") # # Instantiate connection object for TSDB # client = InfluxDBClient(url=URL, token=TOKEN, org=ORG) query_api = client.query_api() # # Query InfluxDB cloud instance and return in df # df_current_sats = query_api.query_data_frame( 'from(bucket: "pulse_oximeter") ' "|> range(start: " + last_hour + ", stop: " + now + ") " '|> filter(fn: (r) => r._measurement == "spo2") ' "|> aggregateWindow(every: 5s, fn: mean, createEmpty: false)" '|> drop(columns: ["_start","_stop","_measurement","uid"])' '|> pivot(rowKey:["_time"], columnKey: ["_field"], valueColumn: "_value")' ) # # Close connection object # client.__del__() # # Try and except to handle case when cloud provider is offline or # rpi unplugged (away from wifi). # # Drop don't care columns, rename time header and set time as index # try: df_current_sats.drop(labels=["table", "result"], axis=1, inplace=True) df_current_sats.rename(columns={"_time": "time"}, inplace=True) df_current_sats["time"] = pd.to_datetime(df_current_sats["time"]) df_current_sats["time"] = df_current_sats["time"].astype("datetime64[us]") df_current_sats.set_index("time", inplace=True) # # Instantiate sqlalchemy connection object # engine = create_engine(PG_CONNECT) # # Use pandas to write dataframe to postgres instance # df_current_sats.to_sql("pox_five_second_mean", engine, if_exists="append") # # Close sqlalchemy connection object # engine.dispose() except KeyError: pass ``` ### Scheduling considerations Now that the script is complete, and I know that it works, I need to ensure that it is run once per hour. The path of least resistence since I am running a single script on the same raspberry hosting the postgresql server, will be to run the script as a cronjob. I realize that there are more robust scheduling solutions such as **Apache Airflow** and I intend to do the same with a DAG (Directed Acyclic Graph) at a later date. For now, I want to get the data collected and I do not have airflow installed on the rpi3b+. So, forgive my rush and the simplicity of the solution. After pushing the script to the raspberry pi 3b+, installing the dependencies (python-dotenv, psycopg2, sqlalchemy, influxdb-client and pandas) and creating my .env file containing my secrets, I can create the cronjob like so: ```bash pi@influxdb-historic:~ $ crontab -e ``` ```bash 0 * * * * /usr/bin/python3 /home/pi/pg_loader/health_metrics_loader.py >> ~/cron.log 2>&1 ``` That should do the trick. I will check later tonight, plug my new postgres instance into my Grafana install running on my workstation and see if the data is collecting successfully. ### 8-hours later (Update) It looks like the script and the cronjob were a success: <img src="/assets/grafana_historic.png" alt="drawing" style="max-width: 100%; height: auto; text-align: center;"/> I will be revisiting this project to discuss various models when I have ample historical data. I might also redo the scheduling using Apache Airflow. I am hesitant since it seems like bird hunting with a scud missile and slightly overkill. If you see any opportunities for improvements or mistakes, please shoot me an email. Thank you for reading and for your time. > “Questions give us no rest.” > ― Ayn Rand

RPI3b+ (Headless) running PostgreSQL instance on localhost for longer term data retention (Part 1)

Jul 10, 2021

#rpi #headless #postgresql

## Motivation I have been collecting my daughter's pulse oximeter data for almost two years. She is 24-hour ventilator dependent and my wife and I have had some close calls where she has stopped breathing. Initially, the only way we could see her sats was to be physically in front of the pulse oximeter. Obviously, this is not a realistic option. Over the last few years, I have attemped different solutions to not only monitor her, but also to model her breathing and various health metrics. Some of those solutions were: 1. php scripts through apache on a local server 2. InfluxDB also on a local server and plugged into Grafana for visualization 3. InfluxDB Cloud, which enabled us to monitor her remotely I still use number 3 above to this very day: <img src="/assets/mobile_web_grafana.jpg" alt="drawing" style="max-width: 100%; height: auto; text-align: center;"/> Full disclosure, I am cheap and use the hobbyist free account for both InfluxDB and Grafana which understandably comes with some limitations. One of those limitations is a 30-day data retention policy for InfluxDB. 30-days is great for real time monitoring, but what if I want to model seasonal patterns year-on-year? This is the spirit of this project. I want to capture as much historical data as possible without spending any money. I already have a spare raspberry pi 3b+ and I have an internet connection, what more do I need? Covered in this write-up today will be: 1. Downloading Raspberry Pi OS and flashing the OS to a microSD 2. Activating SSH and the WIFI connection to run headless 3. Remotely logging into pi to bring the system up-to-date 4. Downloading and configuring postgreSQL 5. Testing that configuration Part 2 of this series will actually extract the data I have in my InfluxDB Cloud instance using the Flux query language, shape the data and push the desired data to the postgreSQL instance I will be creating today. So, let us begin... ### Downloading raspberry pi OS and flashing the OS to a microSD card I have been using raspberry pi's for years and I cannot emphasize enough, how great they are for the price. Once ordering a board which typically costs me ~35$ US, I need to download an OS and flash that OS to a microSD card. Since I am looking to do this quickly, I am not installing anything cute like Arch or Gentoo. Raspberry Pi OS it is. The images can be downloaded here: [OS Download](https://www.raspberrypi.org/software/operating-systems/). You will find multiple versions, in my case since I am configuring this single board computer to run headless, I do not need a GUI or any additional software (like LibreOffice, etc..) so I selected Raspberry Pi OS Lite. Once the OS is downloaded, I change into the directory where the download was saved (in my case ~/Downloads/) and I unzip the image using the following command: `$ unzip 2021-05-07-raspios-buster-armhf-lite.zip` which will uncompress a single image. In my case, the image was named **2021-05-07-raspios-buster-armhf-lite.img**. At this stage, for the past few years, I have used balenaEtcher to flash images to microSDs: [Balena Download](https://www.balena.io/etcher/), I have never had any issues and the software works wonderfully. Today however, I wanted to try out a USB flashed I have resident on my OS which comes installed on POP!_OS named **Popsicle**.: <img src="/assets/popsicle_screenshot.png" alt="drawing" style="max-width: 100%; height: auto; text-align: center;"/> Once the image was selected and pointed at my microSD card, I can proceed to burn the image. If I was hooking the raspberry pi up to a monitor and keyboard after the image was burned, I could go ahead and install the microSD card into my raspberry pi and skip the next section below. But, since I do not have an extra monitor and want to run headless, I will be adding a few extra files to the microSD card before I remove the microSD card from my workstation. ### Activating SSH and WIFI connection to run headless To avoid needing a monitor and a keyboard for my rpi and to make the single board computer available online to access via SSH, I need to add 2 files to the root of /boot on the newly installed image. On my workstation, once the microSD is mounted to my filesystem, I can change directory into /boot (the microSD card is mounted for me at /media/run/), and `touch ssh`, this will create an empty file named **ssh** which enables ssh. The second file I need to create will be named **wpa_supplicant.conf** and it should be located in the same directory, I just created the empty ssh file. The contents of this file are: country=FR ctrl_interface=DIR=/var/run/wpa_supplicant GROUP=netdev update_config=1 network={ ssid="NAME-OF-YOUR-WIFI-NETWORK" psk="PASSWORD-FOR-YOUR-WIFI" } I can save and close the above file and I am done. The microSD card can be unmounted, ejected and installed into my raspberry pi. Once the raspberry pi is plugged into a 5V power supply, and I am within range of my wifi, I should be able to ssh in using my workstation. ### Remotely logging into pi to bring the system up-to-date Once my pi is plugged in and out of the way, I can use **nmap** from my workstation to find out which ip address my rpi was assigned. But, first I need my inet ip address: $ ifconfig wlp5s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.0.28 netmask 255.255.255.0 broadcast 192.168.0.255 inet6 2a01:e0a:897:1680:4e48:5fa5:da96:558c prefixlen 64 scopeid 0x0<global> inet6 2a01:e0a:897:1680:19f8:90c1:6834:f9e6 prefixlen 64 scopeid 0x0<global> inet6 fe80::3e7d:52b9:f37d:b024 prefixlen 64 scopeid 0x20<link> ether 74:d8:3e:01:6d:14 txqueuelen 1000 (Ethernet) RX packets 66847 bytes 50766070 (50.7 MB) RX errors 0 dropped 36 overruns 0 frame 0 TX packets 46905 bytes 9180586 (9.1 MB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 The relevant ip address above is **192.168.0.28**. Armed with this, I can now use nmap to determine my rpi's address: $ sudo nmap -sn 192.168.0.28/24 Starting Nmap 7.80 ( https://nmap.org ) at 2021-07-10 14:58 CEST Nmap scan report for 192.168.0.12 Host is up (0.25s latency). MAC Address: B8:27:EB:E0:08:FB (Raspberry Pi Foundation) So, I can access my rpi remotely via ssh using the ip address **192.168.0.12**. Please note that the default password for raspberry's is _raspberry_ so enter that when prompted: $ ssh pi@192.168.0.12 pi@192.168.0.12's password: Linux influxdb-historic 5.10.17-v7+ #1421 SMP Thu May 27 13:59:01 BST 2021 armv7l The programs included with the Debian GNU/Linux system are free software; the exact distribution terms for each program are described in the individual files in /usr/share/doc/*/copyright. Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law. Last login: Sat Jul 10 12:27:06 2021 from 192.168.0.28 Please note that I have logged in already prior to this so your output my be slightly different. The first thing I did was to change the password from raspberry to a password of my choice. To do this, type `passwd` and then type enter. You could also type `passwd pi` and then enter. Follow the prompts to update the password. After this is complete, I like to change the keyboard and language settings using `sudo raspi-config`. Once all the settings are to my liking, I save and restart the pi: `sudo reboot`. That will kick me off ssh and terminate my connection. After a minute or so, I can re-connect via `ssh pi@192.168.0.12`, enter my new password and once logged in, continue to update my packages from the repo with `sudo apt update && sudo apt upgrade`. That's it, for this section, next step will be downloading and configuring the postgresql server. ### Downloading and configuring the PostgreSQL Database Server If you are un-familiar with databases, PostgreSQL is considered a RDBMS or **R**elational **D**atabase **M**anagement **S**ystem and is in a nutshell, an excellent place to store relational data. In a corporate setting I was more familiar with Oracle, but PostgreSQL is essentially the same thing for zero cost (which I would argue makes it better than Oracle :stuck_out_tongue_winking_eye:). Anyway, moving on... Let's install it: $ apt search postgresql postgresql/stable,now 11+200+deb10u4 all object-relational SQL database (supported version) The above command is used to search through the default system repos. There will be many items returned, but the package I would like to install is the supported version seen above. $ sudo apt install postgresql After this is complete, you can check to see if the database server is running: $ systemctl status postgresql ● postgresql.service - PostgreSQL RDBMS Loaded: loaded (/lib/systemd/system/postgresql.service; enabled; vendor preset: enabled) Active: active (exited) since Sat 2021-07-10 12:50:01 CEST; 2h 57min ago Process: 7771 ExecStart=/bin/true (code=exited, status=0/SUCCESS) Main PID: 7771 (code=exited, status=0/SUCCESS) Jul 10 12:50:01 influxdb-historic systemd[1]: Starting PostgreSQL RDBMS... Jul 10 12:50:01 influxdb-historic systemd[1]: Started PostgreSQL RDBMS. If I didn't see this above, I would need to start the service myself. Also, if I would like the postgres database to start at boot, then I would need to enable it (assuming systemd): $ sudo systemctl start postgresql $ sudo systemctl enable postgresql So, now postgreSQL is installed and running in the background. I can login into the database but need to switch to the postgres user first: $ sudo su - postgres Now to connect through CLI using the postgresql-client: postgres@influxdb-historic:~$ psql psql (11.12 (Raspbian 11.12-0+deb10u1)) Type "help" for help. postgres=# And it was a success. I typically create a new user at this point with a password. This user I am creating will be the owner of my historical data database. postgres=# CREATE USER soitgoes511 WITH PASSWORD '<YOUR_PASSWORD_HERE>'; Finally, I want to create my database and change the ownership to the new user I created: postgres=# CREATE DATABASE pulse_oximeter_historic; postgres=# ALTER DATABASE pulse_oximeter_historic OWNER TO soitgoes511; I can see my new database owned by yours truly: postgres=# \l List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges -------------------------+-------------+----------+-------------+-------------+----------------------- postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | pulse_oximeter_historic | soitgoes511 | UTF8 | en_US.UTF-8 | en_US.UTF-8 | template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres + | | | | | postgres=CTc/postgres template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres + | | | | | postgres=CTc/postgres (4 rows) ### Test database by writing to it remotely Before testing, I need to make a a few changes to some of the postgreSQL configuration files. The loader I will be writing to populate this database with historical data will most likely run locally (on the pi), but I will be testing from my workstation. Therefore, I need to give authorization for my ip address to connect to the db and I need to the database to listen for more than the localhost. I will also be accessing the database to query data for modeling eventually and will need to ensure that I can access it: $ sudo vim /etc/postgresql/11/main/postgresql.conf First, I uncomment and change this line under _connections and authentication_: listen_addresses = '*' # what IP address(es) to listen on; Save and close. Then I can open up **pg_hba.conf** and give permissions to my workstation to connect: $ sudo vim /etc/postgresql/11/main/pg_hba.conf host all all 192.168.0.28/32 trust Once this line is appended, I can restart my postgresql server and attempt to write some data to it: $ sudo systemctl restart postgresql From my workstation, I am using a **jupyter-notebook** and python 3 to first test the remote connection and then to load a dummy dataset I downloaded as a csv file off **Kaggle**. Here is a screenshot of those scripts and the output: ```python import psycopg2 conn = psycopg2.connect( host="192.168.0.12", database="pulse_oximeter_historic", user="soitgoes511", password="<YOUR_PSQL_PASSWORD>", ) # create a cursor cur = conn.cursor() # execute a statement print("PostgreSQL database version:") cur.execute("SELECT version()") # display the PostgreSQL database server version db_version = cur.fetchone() print(db_version) # close the communication with the PostgreSQL cur.close() ``` The connection was successful. OUTPUT: PostgreSQL database version: ('PostgreSQL 11.12 (Raspbian 11.12-0+deb10u1) on arm-unknown-linux-gnueabihf, compiled by gcc (Raspbian 8.3.0-6+rpi1) 8.3.0, 32-bit',) Let me attempt to write some data to the db: ```python from sqlalchemy import create_engine import pandas as pd engine = create_engine( "postgresql://soitgoes511:<YOUR_PSQL_PASSWORD>@192.168.0.12:5432/pulse_oximeter_historic" ) df_best_sellers = pd.read_csv("/home/soitgoes/Kaggle/bestsellers_with_categories.csv") df_best_sellers.to_sql("bestsellers", engine, if_exists="replace", index=False) engine.dispose() ``` Finally, to sanity check the data made it there, let me query it: ```python from sqlalchemy import create_engine import pandas as pd engine = create_engine( "postgresql://soitgoes511:<YOUR_PSQL_PASSWORD>@192.168.0.12:5432/pulse_oximeter_historic" ) query = """SELECT * FROM bestsellers LIMIT 5;""" df = pd.read_sql(query, engine) engine.dispose() df.head() ``` <img src="/assets/dummy_table.png" alt="drawing" style="max-width: 100%; height: auto; text-align: center;"/> It worked :sparkles:. That is a wrap for today. Part II as I mentioned will delve into actually extracting the relevant data, transforming/shaping it and then loading it into my new postgres instance hosted on my very cheap and wonderful rpi3b+. Thank you for reading. > “I think, at a child's birth, if a mother could ask a fairy godmother to endow it with the most useful gift, that gift would be curiosity.” > ― Eleanor Roosevelt

Broken virtual environments: Why it happened and how I should have prevented it

Jul 6, 2021

#linux #update #python #venv #broken #prevention

Approximately 3-4 days ago, I updated my Linux box (**Pop!_OS** version 20.10 -> 21.04). I have run Linux long enough to know that there could be unintended issues during these major version updates. My first defense against possible headaches is to wait for a few days and hope others find the bugs, report them and everything gets ironed out quickly. Years ago I was not as cautious :see_no_evil:, but I no longer have the luxury of time to bang my head into a desk for days and slog through issues or re-compile kernels. I just play it safe now. The good news is that despite a drastic change to the Pop!_OS desktop environment (from **GNOME 3.38.4** to **COSMIC**), the update and upgrade went very smoothly. The new DE is an improvement on GNOME, my system is very responsive, the stars all aligned and everything is perfect. Kudos to the system76 team :thumbsup:. But.. Not everything was perfect... ### What exactly was the problem? Let me begin by saying that my issue has nothing to do with my updated OS. I decided yesterday that I wanted to update a Heroku application which lives in a local repository on my now updated workstation. This application was written in Python and the interpreter executed within a virtual environment (venv). Virtual environments should be self contained, correct? Sounds safe right :skull:? What happened was that the OS upgrade replaced my pre-existing system python (3.8) with a newer python version (3.9). Had I sourced a standalone python install when creating the virtual environment, or had the OS just switched PYTHON_TARGETS, then I would not be writing this right now. I had created the venv like so for this particular Heroku application: `soitgoes@pop-os:~$ python3 -m venv venv` The above command created said virtual environment inside a directory called venv with a symbolic link pointing from the venv interpreter to my system python interpreter @ /usr/bin/python3.8 which no longer existed :disappointed:: `lrwxrwxrwx 1 soitgoes soitgoes 6 Jun 17 21:38 python3.8 -> python` At my previous work, I had many python versions installed on a dev mount to avoid this very issue. ### How to fix the issue and my steps moving forward ... Install a standalone development python which lives seperately and isolated from my system python. This would avoid the pain I am enduring now when my system is updated again with a new python install. I have no excuse for dropping the ball like this but it is what it is. Lesson learned. Steps to remedy the situation, starting with that isolated python install: 1. Download desired python version `$ wget https://www.python.org/ftp/python/3.8.11/Python-3` 2. Uncompress the python version `$ tar -xzvf Python-3.8.11.tgz` 3. Change into directory `$ cd Python-3.8.11/` 4. Configure with target location `$ ./configure --prefix=/home/soitgoes/python-3.8.11` 5. Compile and build `$ make` 6. Test build `$ make test` 7. Install python to target location `$ make altinstall` I can see that the above steps were successful by executing the newly installed interpreter which was specified in my target location above (/home/soitgoes/python-3.8.11) : `soitgoes@pop-os:~$ ~/python-3.8.11/bin/python3.8` And voilà... The REPL (**R**ead, **E**valuate, **P**rint, **L**oop) prompt appears reflecting my desired version: Python 3.8.11 (default, Jul 6 2021, 22:44:39) [GCC 10.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> My system python has been unchanged and I can verify this very easily: soitgoes@pop-os:~$ which python3 /usr/bin/python3 soitgoes@pop-os:~$ python3 Python 3.9.5 (default, May 11 2021, 08:20:37) [GCC 10.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> Please notice the versioning differences. The next step in this process will be to rebuild the symbolic links in the affected virtual environment. Please make note that the python3 symlink points to the system /usr/bin/python3 (which is the issue since it is now python 3.9 rather than 3.8). The other symlinks just chain all the python aliases together: python -> python3 which I just mentioned points to the system python install. Finally, the last link is python3.9 -> python3 again: lrwxrwxrwx 1 soitgoes soitgoes 7 Jul 4 22:58 python -> python3 lrwxrwxrwx 1 soitgoes soitgoes 16 Jul 4 22:58 python3 -> /usr/bin/python3 lrwxrwxrwx 1 soitgoes soitgoes 7 Jul 4 22:58 python3.9 -> python3 These soft links need to be re-directed to my newly built and installed python which is now located in my home directory. I can do this simply by changing into the affected venv/bin/ directory where the symlinks are present and then: `$ ln -sf /home/soitgoes/python-3.8.11/bin/python3.8 python3` The symlinks are now corrected. When activating the previously affected venv, and running the python interpreter I am greeted with version 3.8.11 rather than 3.9.5. My application will again run without the need of rebuilding all of my dependencies. Since this is fixed and it is getting late, I will need to actually do what I set out to do initially (update my Heroku app) later. Enough problems solved for one day. **The moral of the story is to ensure you seperate your system dependencies from your development dependencies. No sense in muddying the waters and causing unnecessary headaches.** :heavy_check_mark: > The mind is not a vessel to be filled, but a fire to be kindled. > _― Plutarch_

Initial observations while settling in

Jun 26, 2021

#france #blog #careers #EE

I have begun applying for jobs. Perhaps I am being un-realistic considering I am a beginner in the French language. I could not imagine finding a job in the states and not speaking English. :crossed_fingers: ### Some observations ... My first observation is that most positions in the fields I am looking for are looking for Master's or PHd's. I guess when university doesn't cost $100k for an undergraduate degree :moneybag:, there is more motivation to go a step further. I find myself regretting not sticking it out for my Master's but that ship has sailed. At least I have experience in industry. My second observation is that Electrical Engineering degrees here do not translate to the same studies in the US. I could be wrong and I will have a better perspective on this very shortly. I have noticed quite a bit of variation in EE degrees in the US, too. I have friends which never studied transistors at their schools. We were drowned in small signal models, large signal models, BJT's, MOSFET's, etc.. After long discussions with my wife in the past, my understanding is that the French University system is, for the most part, standardized. Due to the second observation above :point_up:, there will probably be some ambiguity for potential employers. I am attempting to find a career in big data, data science, data engineering, analytics, web development, etc.. Not sure if that aligns even remotely with what their EE's learn at University, but it is not a huge leap from my past curriculum and studies. Only time will tell. It has only been 2 weeks since I set foot on French soil and I need to be patient and keep plugging away. On the bright side, in the meantime, I can spend time with my family. I never would have had this much time with my children had we stayed in the US. > Per aspera ad astra

A new start in France

Jun 21, 2021

#first #france #blog #careers #family

I will keep this short and sweet. I left corporate life in **America** to move to my wife's home country, **France**. I had worked in the _semiconductor industry_ for about 4 years as an _engineer_ in various roles. My background at University was **Electrical Engineering** with a concentration in microelectronics. My employer treated me well and finances were good. I had a real career. About 3.5 years ago my beautiful daughter was born. My wife and I were informed at our 20 week sonogram that there were issues. Our life has never been the same since. I have been through plenty of hard times in my life, but to this day, seeing her struggles has really darkened my soul. We came to France so her and my son could be with family. We came to France so they could be loved. We are only here on this earth for a brief moment and like _Bob Marley_ said so wonderfully: >"...life is worth much more than gold." This is my first blog entry. I promise that I will delve into some fun projects down the road. I am waiting until we purchase our house to nerd out on some **IoT and pub-sub messaging systems**. If you read this, thank you. :grin: <div style="display: inline-block;"> <img src="/assets/baby_girl_1.jpeg" alt="drawing" style="height: 300px;"/> </div> <div style="display: inline-block;"> <img src="/assets/french_town_1.jpg" alt="drawing" style="height: 300px;"/> </div> <div style="display: inline-block;"> <img src="/assets/baby_boy_1.jpeg" alt="drawing" style="height: 300px;"/> </div>