Querying Ethereum State with the Ethereum Data Warehouse

Imagine logging in to your bank to check your historic balance, but only seeing the history of your transactions, not your balance? You could go through all transactions and calculate your balance at any point in time, but it's hardly the best way to do it. This is how most analyses on Ethereum data work now.

The complexity of storage data is why so many crypto-novices ask the same question over and over again: “How can I iterate over hash maps (e.g. balanceOf of any ERC20 token)?” and are so surprised to hear: “You cannot!”.

For many use cases this is the only way to go. Let’s explore the options – you could build local databases or use one of existing data providers like The Graph or Dune Analytics to query the history of events. If you are a blockchain data analyst you know this pattern well: “I want a history of state so I have to aggregate historical events to recreate smart contract values at a given point in time”.

Ethereum Data Warehouse provides an easier way to do this. Its major and breakthrough feature is the detailed history of state and storage changes or state and storage diffs. It enables new types of queries that answer some basic questions in the simplest and most reliable way possible – by directly using variables of smart contracts.

Let’s see how we answer some frequent questions. 

Question 1 - What is the full list of token holders at a point in time?

This looks like a very simple question – who owns a given token?

The traditional events-based approach takes in all the transfers and looks for unique addresses at the end. Depending on the token, this can be a massive undertaking. This would be something like this:

  1. Analyze the contract source code to fully understand its logic. 
  2. Identify all types of events that can be used to identify new holders (typically it is enough to use Transfer events but, not always.  WETH for example can also emit a Deposit event instead of Transfer so you have to use both of them).
  3. Make sure that new holders cannot be added without emitting any event (in such case you probably should also aggregate relevant internal calls).
  4. Select all identified events since the contract deployment and aggregate them to get a distinct list of holders.

Now with TokenFlow’s decoded storage diffs this task becomes much easier:

  1. Identify the variable that acts as the register of token holders and their balances. In this case it’s the variable balanceOf, which is a mapping (address => uint)
  2. Query all distinct keys used in this mapping.

A sample query to get the full list of holders of DAI would be:

				
					select 
  distinct(hashmap_keys[0]) as token_holder
from 
  storage_diffs 
where 
  contract_address = '0x6b175474e89094c44da98b954eedeac495271d0f' 
  and variable = 'balanceOf' 
  and not reverted
				
			

The decoded data makes this task trivial! You can just inspect the storage variable names for any contract and then query them to get their values.

Question 2 - How can I see the total supply of a token in time?

This is even a simpler question than the previous one, but unfortunately it is not trivial using the traditional approach.

Assuming you preprocessed the events and calls data as before to get a list of transfers for a specific token, which can be a challenge of itself, you will then have to do the following:

  1. Understand that mints are USUALLY (not always) encoded as transfers from zero (0x0000…0)
  2. Understand that burns are USUALLY (not always) encoded as transfers to zero address
  3. Aggregate everyting

A sample query for SHIBA INU on our own TRANSFERS table:

				
					with daily_changes as (
   select date(block_timestamp) as date, 
       sum(iff(sender = '0x0000000000000000000000000000000000000000',amount, 
           iff(receiver = '0x0000000000000000000000000000000000000000', -amount, 0))) as daily_change
    from transfers
    where token_address='0x95ad61b0a150d79219dcf64e1e6cc01f0b64c4ce'
    group by date)
select distinct date, sum(daily_change) over (order by date) as balance
from daily_changes
order by date desc;
				
			

This only works if the specific token is only created or destroyed by mints and burns. If a token has some sort of yield or interest mechanism, this approach will not give a comprehensive overview! Sometimes, contracts can use any arbitrary address for mints and burns, ie. 0x00000…1, 0x00000…2, in addition to 0x0. To get a complete overview, you should check for all the possible “burn” addresses.

With the storage diff approach, this is as easy as reading the most recent value for the _totalSupply (in the case of the  SHIBA INU token) storage variable of the contract!

And this will ALWAYS give the correct answer provided the contract code increases or decreases total supply with every creation or destruction of a token unit as it should.

We will make use of the last_value window function to make things easier.

				
					select 
  distinct(date(block_timestamp)) as day, 
  (
  last_value(curr_value) over (
      partition by date(block_timestamp) 
      order by block_number, order_index)
  ):: INTEGER as balance 
from storage_diffs 
where 
  contract_address = '0x95ad61b0a150d79219dcf64e1e6cc01f0b64c4ce' 
  and variable = '_totalSupply' 
  and not reverted 
order by day desc
				
			

Question 3 - Which contracts store or use my address?

With the amount of hacks we saw in the past, that should be a basic security question, yet the traditional approach does not offer us any useful tooling here. 

Let’s use TokenFlow instead then – simply look for all variables that use my address as a key and storage values that are equal to my address. 

				
					select 
    distinct(contract_address) 
from storage_diffs
where 
    array_contains('0x0035fc5208ef989c28d47e552e92b0c507d2b318'::variant, hashmap_keys) 
    or curr_value='0x0035fc5208ef989c28d47e552e92b0c507d2b318'
				
			

Closing thoughts

When looking at the examples above, it’s quite obvious that with currently available tooling, there’s a huge chunk of Ethereum data that is entirely inaccessible to most data analysts.

TokenFlow’s EDW aims to change this paradigm so that:

  1. Entire history is accessible to everyone with basic SQL knowledge by providing a semantic, decoded layer on top of raw data
  2. Complex queries are 1) possible to execute and 2) cheaper/faster to run thanks to using storage history versus just events
  3. Most useful master datasets like dictionaries of contracts, tokens, pools, vaults, etc. are distilled for ease of use and standardisation of queries.

Excited to know more? Talk to us and let’s work together!

More articles