Querying Ethereum State with the Ethereum Data Warehouse

Using the Ethereum Data Warehouse to directly query State and solve data analysis problems in a new, simpler and more reliable way.

The Ethereum Data Warehouse (EDW) brings lots of new possibilities for blockchain data analysis. Its major and breakthrough feature is a detailed history of state and storage changes. It enables new types of queries that answer some basic questions in the simplest and most reliable way possible – by directly using variables of smart contracts. Let’s look at a few examples.

With the way that Ethereum storage is organized it is almost impossible to query storage directly. It is because memory locations are actually hashes of variable slots and keys. This is why so many rookies ask the same question over and over again: ‘How can I iterate over mappings (e.g. balanceOf of any ERC20 token)?’ and are so surprised to hear: ‘You cannot!’.

But you need it – so what are your options? Most people following advice similar to this suggestion https://stackoverflow.com/a/68863130, they build local databases or they use one of existing data providers like The Graph or Dune Analytics to query the history of events. If you are a blockchain data analyst you know this pattern well:  “I want a history of state so I have to aggregate historical events to recreate smart contract values at a given point in time”.

This is equivalent to you logging in to your bank to check your balance but you only see the history of your transactions, not your balance. Of course you can go through all transactions and calculate your current balance, but is it really the most efficient and reliable way?

With EDW you can use the full history of events and calls in case you are actually interested in the history of ACTIONS but you also have a history of smart contract variables in case you need the history of STATE. This totally changes the approach to some typical problems as shown in the four examples below.

Problem 1 - What is the full list of token holders at a point in time?

Traditional approach:

  1. Analyze the contract code to fully understand its logic.
  2. Identify all types of events that can be used to identify new holders (typically it is enough to use Transfer events but for example WETH may emit a Deposit event instead of Transfer so you have to use both of them).
  3. Make sure that new holders cannot be added without emitting any event (in such case you probably should also aggregate relevant internal calls).
  4. Select all identified events since the contract deployment and aggregate them to get a distinct list of holders.

New approach:

  1. Identify the variable that acts as the register of token holders and their balances (typically balanceOf).
  2. Query all distinct keys used in this mapping.

A sample query to get the full list of holders of DAI would be:

				
					select distinct substr(location, 3, 42) as holder
from edw.raw.storage_diffs 
where contract = '0x6b175474e89094c44da98b954eedeac495271d0f' and 
      location like '2[%' and status;

				
			

In the raw EDW layer we store memory locations as strings following the pattern:

				
					slot[key1].field1[key2].field2
				
			

where slot is a memory slot used for storage, key1 is an optional first mapping key, field1 is an index of structure field or 0, key2 is an optional second mapping key and field2 is an optional index of structure field or 0.

For DAI the contract address is 0x6b17...d0f and the holders register is defined as: mapping (address => uint) public balanceOf and it is stored in slot number 2. This means that in EDW relevant memory locations are for example 2[0x2faf...ad2].0 (slot number 2, holder address 0x2faf…ad2 and no structure used). Additionally we add status to the WHERE clause to only allow for successful storage changes.

This obviously requires some knowledge about a storage layout of a given contract but is quite easy to get and with the next decoded layer of EDW (to be released soon) it will be translated to human readable format and split into separate fields.

Problem 2 - What is the full status of a given contract?

Traditional approach:

1. Well… 😓

New approach:

1. Get the list of memory locations used by a contract (e.g. UNI2 pool) with a simple query:

				
					select distinct location
from edw.raw.storage_diffs 
where contract = '0xd0671e35d25d82ec6b75e3de43975c89968d1c91' and status
order by location;
				
			

2.  If you additionally want to get the snapshot of storage variables at a given point in time you should look for the latest value written to this location before or at a given block. This requires using window functions which enable many new constructs not available with a simple GROUP BY clause. For example to get the latest values written to memory before or at a given block you should use last_value function over a partition defined by each memory location with a proper ordering:

				
					select distinct location,
       last_value(curr_value) 
       over (partition by location order by block) as value
from edw.raw.storage_diffs 
where contract = '0xd0671e35d25d82ec6b75e3de43975c89968d1c91' and 
      block <= 14000000 and status
order by location;
				
			

Problem 3 - Which contracts are allowed to call user proxies in the OpenSea Registry?

Traditional approach:

  1. Analyze the contract code to fully understand its logic.
  2. Find out that you can grant and revoke access by calling a few different functions startGrantAuthentication , endGrantAuthenticationrevokeAuthentication).
  3. Realize that granting is actually a two step process with a delay period.
  4. Additionally realize that these functions do not emit any events.
  5. Write a non-trivial aggregation logic using the history of internal calls.

New approach:

  1. Identify the variable that acts as the register of allowed contracts – in this case it is mapping(address => bool) public contracts occupying slot number 4.
  2. Query the keys with their current mapping values.

Sample query:

				
					select distinct substr(location, 3, 42) as contract, 
       last_value(curr_value) 
       over (partition by location order by block) as allowance
from edw.raw.storage_diffs
where contract = '0xa5409ec958c83c3f309868babaca7c86dcb077c1' and 
      location like '4[%' and status;
				
			

Problem 4 - Which contracts store or use my address?

Traditional approach:

  1. Well… 😓

New approach:

  1. Simply look for all locations that use my address as a key and storage values that are equal to my address.

				
					select distinct contract, location
from edw.raw.storage_diffs 
where location like '%0x0035fc5208ef989c28d47e552e92b0c507d2b318%' or
      curr_value = '0x0035fc5208ef989c28d47e552e92b0c507d2b318';
				
			

What comes Next

The sample problems presented above are just the tip of the iceberg.

With the full history of state and storage diffs the Ethereum Data Warehouse creates endless new possibilities. It even provides the history of storage reads for use cases like e.g. oracle usage analysis.

With the next edition of EDW (coming soon) we will be adding a next decoded layer of data that will greatly simplify the queries by using variable names instead of slots and casting storage values to the proper types.

Stay tuned…

More articles