- This topic has 15 replies, 8 voices, and was last updated 9 years, 9 months ago by uRADMonitor.
-
AuthorPosts
-
December 6, 2014 at 8:20 pm #657uRADMonitorKeymaster
Due to large volume of data, 68bytes/station/minute, a way of optimising storage and data access becomes mandatory.
December 6, 2014 at 11:44 pm #665LarsParticipantHow is the database laid out today?
December 7, 2014 at 11:05 am #678uRADMonitorKeymasterCurrently there is a single database for both development and production. There are two tables, one is devices and the other is uradmonitor.
t1) devices
Has a single row / uRADMonitor detector, the primary key is the device id. It holds only the latest readings, the 24h average for radiation (as CPM) , the detected location, the overridden location , country code and offline/online status (considered offline if no data received for more than 10minutes).
As we speak this has 116rows and uses a little over 12KB.t2) uradmonitor
This now has 7,4 million rows and a size of 432MB.
Each unit in the network sends data via a HTTP Get call to the data.uradmonitor.com . The parameters include the unit ID, measurements and a CRC. All this data gets into the database.
So each minute, one station send aprox. 58Bytes of data to the server.
The problem is we got quickly from one unit to ten, and we are now approaching 100 units.
Probably we’ll need to archive some of the old data. Would be nice to find a way to do it automatically, to get faster DB queries on the fresh data, but for scientific purposes to allow slower queries but on ALL the data.December 7, 2014 at 1:35 pm #683AllyParticipantRadu, on table 2 is there a timestamp for when new rows are inserted to the table?
There are some good open source ETL tools I’ve been looking I to which would allow us to easily, and automatically, archive off rows from t2 into an archive table. This table wouldn’t be queried under normal circumstances I guess, but would be available if there was a need to see the whole history of a unit.
For all the tables, we need to make sure they’re properly indexed and partitioned to optimise query speed – I’m not sure what options MySQL provides for that, will need to look into that.
December 7, 2014 at 2:13 pm #685vinzMemberHello Radhoo, hello Ally,
The table with raw-values you have will be needed allways to store raw-data (maybe it can be compressed).
From this table you will create views or other (processed) tables. But first of all, we need to know what data we need for what to display.
As a basis of discussion, I inclue the structure of my (quickly made) table.
My device reports to this.For example, first I had the “timestamp” for every entry. The I realized, I’m much faster in selcting and grouping when I use a integer “timestamp_unix” instead.
Grouping I use in views, for example 10min-averages or 1h-averages over a given timespan.
mysql> SHOW CREATE TABLE tab_values\G Create Table: CREATE TABLE tab_values ( id int(11) NOT NULL AUTO_INCREMENT, created timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP, timestamp timestamp NOT NULL DEFAULT '0000-00-00 00:00:00', timestamp_unix int(11) DEFAULT NULL, uradid varchar(16) DEFAULT NULL, p varchar(16) DEFAULT NULL, ts int(11) DEFAULT NULL, inv int(11) DEFAULT NULL, ind int(11) DEFAULT NULL, s1t double DEFAULT NULL, cpm int(11) DEFAULT NULL, c int(11) DEFAULT NULL, remote_ip varchar(45) DEFAULT NULL, forwardfor_ip varchar(45) DEFAULT NULL, PRIMARY KEY (id) ) ENGINE=InnoDB DEFAULT CHARSET=utf8
December 9, 2014 at 3:31 pm #769HexideParticipantHello,
Data aggregation will be pretty much mandatory when amount of devices increases.
But using views to get data for any time period from main data table ( which contains raw data ) is extremely CPU and I/O intensive. So for example requesting 12 months of data for specific device still requires mysql to go through all 525600 ( 60x24x365 ) records for said device and calculating averages for each month. Which is still 43800 times more work ( excluding time which takes to grab averages ) than just selecting aggregated rows from table containing monthly averages/minimums/maximums. This data would get there at the end of every month via cron script.December 9, 2014 at 4:49 pm #773uRADMonitorKeymasterA cron script to compute average/month, ran once per month?
Probably this can be taken further: the cron script would compute the average for last month, and move all previous month rows to “archive” db.
However some people would like to “go back in time” and analyse data: say there is a spike that happened two months ago on a given station. One might want to check the data including “zooming” to see max resolution data to understand the phenomenon better. Then move to other stations nearby to check for the same thing.
I wonder how this scenario could be implemented efficiently.
There is one row/device/minute added to the db constantly.
December 9, 2014 at 5:15 pm #775HexideParticipant>A cron script to compute average/month, ran once per month?
Yes, it could be. However if there will me many devices, high load will be generated on server. So to prevent this from happening there are couple ways:1. Calculate each average with intervals, this will take longer for data to be generated but it works well.
2. Load entire months worth of data into a memory table and request it from there.
3. Load entire months worth of data into script variable and use it within script.
4. Have a read-only mysql replication and use it to query data.Once averages are collected, raw data for previous month can be moved into archive table ( MySQL has special storage engine for this ), but downside of this engine is that we can’t search for data ( use WHERE clause ). So regular table should be used for archiving.
When displaying full historical graph, we can cache all data for at least few hours. So data from database it’s requested every few hours for single device.
December 9, 2014 at 5:41 pm #777uRADMonitorKeymasterYes, this makes sense. Regarding 2) what would be the memory requirements on the linux box?
- This reply was modified 9 years, 10 months ago by uRADMonitor.
December 9, 2014 at 6:03 pm #787AndyParticipantCheck out RRDtool, it was made for this sort of thing. It’s a very simple and robust tool.
If you want a circular buffer with SQL support, then MySQL really isn’t the tool. PostgreSQL would be your best option.
December 9, 2014 at 6:11 pm #789HexideParticipantProblem with RRD might be that it is circular buffer, which has option to drop resolution to preserve larger history. Of course most of it is configurable but not exactly right tool if we want entire device history.
Edit: Also mixing database engines for different data storage is difficult and really time consuming.
- This reply was modified 9 years, 10 months ago by Hexide.
December 9, 2014 at 6:23 pm #795AndyParticipantI guess that is the question, do you need 1 minute resolution forever?
I wouldn’t see any issues storing data in RRD and configuration/current status in MySQL.
December 10, 2014 at 12:33 am #818TSLParticipantguys…
you need to look at this…https://code.google.com/p/tungsten-replicator/
This enables multi-master replication and/or moving/syncing data to other db’s
Is this data moving towards a “Big Data” moment ? maybe it should be in hadoop ?
regards
Tim
- This reply was modified 9 years, 10 months ago by TSL.
December 10, 2014 at 12:40 am #820AllyParticipantI was thinking of mentioning Hadoop/MapReduce or Cassandra, but these technologies are a big change from Radu’s current implementation, and the learning curve is steep.
Very cool technology though. At my work I design/maintain a 500 gigabyte SQL Server data warehouse with almost 400 tables in a denormalised star schema – have been trying to get my bosses to look at hadoop and MapReduce, but it’s ‘new’ tech and therefore complex and scary 🙂
- This reply was modified 9 years, 10 months ago by Ally. Reason: Spelling and grammar corrections. Fat fingers
January 2, 2015 at 1:50 pm #1095FokkoParticipantFirst of all! Best wishes for 2015.
MySQL is capable of handling moderate sized databases. At this point, I would not recommend going to Hadoop at this point. The problem with MySQL is that it locks the table when it does an insert operation. NoSQL databases do not have this property, but this is at the cost of structure of the data etc. Have a look here: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
Did you tune the parameters of MySQL. By giving MySQL some more memory, and loosening some of the parameters, MySQL is quite capable of handling quite some data.
A wizard can be found here: https://tools.percona.com/wizard. First, try the config on a test-database.
-
AuthorPosts
- You must be logged in to reply to this topic.