[Summary: Thinking aloud about a pragmatic / humanist approach to data infrastructure building]
Stephen Abbott Pugh of Open Knowledge International has just blogged about the Open Data for Tax Justice ‘design sprint’ that took place in London on Monday and Tuesday. I took part in the first day and a half of the workshop, and found myself fairly at-odds with the approach being taken that focussed narrowly on the data-pipelines based creation of a centralised dataset, and that appeared to create barriers rather than bridges between data and domain experts. Rather than the rethink the approach, as I would argue is needed, the Open Knowledge write up appears to show the Open Data for Tax Justice project heading further down this flawed path.
In this post, I’m offering an (I hope) constructive critique of the approach, trying to draw out some more general principles that might inform projects to create more participatory data infrastructures.
The context
As the OKI post relates:
“Country-by-country reporting (CBCR) is a transparency mechanism which requires multinational corporations to publish information about their economic activities in all of the countries where they operate. This includes information on the taxes they pay, the number of people they employ and the profits they report.”
Country by Country reporting has been a major ask of tax justice campaigners since the early 2000s, in order to address tax avoidance by multi-national companies who shift their profits around the world through complex corporate structures and internal transfers. CBCR got a major boost in 2013 with the launch of reporting requirements for EU Banks to publicly disclose Country by Country reports under the CRD IV regulations. In the extractives sector, campaigners have also secured regulations requiring disclosure of tax and licensing payments to government on a project-by-project basis.
Although in the case of UK extractives firms, reporting is taking place to companies house as structured data, with an API available to access reports, for EU Banks, reporting is predominantly in the form of tables at the back of PDF format company reports.
If campaigners are successful, public reporting will be extended to all EU multinationals, holding out the prospect of up to 6000 more annual reports that can provide a breakdown of turnover, profit, tax and employees country-by-country. If the templates for disclosure are based on existing OECD models for private exchange between tax authorities, the data may also include information on the different legal entities that make a corporate group, important for public understanding the structure of the corporate world.
Earlier this year, a report from Alex Cobham, Jonathan Gray and Richard Murphey set out a number of use-cases for such data, making the case that “a global public database on the tax contributions and economic activities of multinational companies” would be an asset for a wide range of users, from journalists, civil society and investors.
Sprinting with a data-pipelines hammer
This week’s design sprint focussed particularly on ‘data extraction’, developing a set of data pipeline scripts and processes that involve downloading a report PDF, marking up the tables where Country by Country data is stored, describing what each column contains using YAML, and then committing this to GitHub where the process can then be replicably run using datapipeline commands. Then, with the data extracted, it can be loaded into an SQL database, and explored by writing queries or building simple charts. It’s a technically advanced approach, and great for ensuring replicability of data extraction.
But, its also an approach that ultimately entirely misses the point, ignoring the social process of data production, creating technical barriers instead of empowering contributors and users, and offering nothing for campaigners who want to ensure that better data is produced ‘at source’ by companies.
Whilst the OKI blog post reports that “The Open Data for Tax Justice network team are now exploring opportunities for collaborations to collect and process all available CRD IV data via the pipeline and tools developed during our sprint.” I want to argue for a refocussed approach, based around a much closer look at the social dynamics of data creation and use.
An alternative approach: crafting collaborations
I’ve tried below to unpack a number of principles that might guide that alternative approach:
Principle 1: Letting people use their own tools
Any approach that involves downloading, installing, signing-up to, configuring or learning new software in order to create or use data is likely to exclude a large community of potential users. If the data you are dealing with is tabular: focus on spreadsheets.
More technical users can transform data into database formats when the questions they want to answer require the additional power that brings, but it is better if the starting workflow is configured to be accessible to the largest number of likely users.
Back in October I put together a rough prototype of a Google spreadsheets based transcription tool for Country by Country reports, that needed just copy-and-paste of data, and a few selections from validated drop-down lists to go from PDFs to normalised data – allowing a large user community to engage directly with the data, with almost zero learning curve.
The only tool this approach needs to introduce is something like tabula or PDFTables to convert from PDF to Excel or CSV: but in this workflow the data comes right back to the user to be able to work with it after it has been converted, rather than being taken away from them into a longer processing pipeline. Plus, it brings the benefit of raising awareness of data extraction from PDF that the user can adopt for other projects in future, and allowing the user to work-around failed conversions using a manual transcription approach if they need to.
(Sidenote: from discussions, I understand that one of the reasons the OKI team made their technical choice was from envisaging the primary users as ‘non-experts’ who would engage in crowdsourcing transcriptions of PDF reports. I think this is both highly optimistic, and relies on a flawed analysis of the relatively small scale of the crowdsourcing task in terms of a few 1000 reports a year, and the potential benefits of involving a more engaged group of contributors in creating a civil society database)
Principle 2: Aim for instant empowerment
One of the striking things about Country by Country reporting data is how simple it ultimately is. The CRD IV disclosures contain just a handful of measures (turnover, pre-tax profits, tax paid, number of employees), a few dimensions (company name, country, year), and a range of annotations in footnotes or explanations. The analysis that can be done with this is data is similarly simple – yet also very powerful. Being able to go from a PDF table of data, to a quick view of the ratios between turnover and tax, or profit and employees for a country can quickly highlight areas to investigate for profit-shifting and tax-avoidance behaviour.
Calculating these ratios is possible almost as soon as you have data in a spreadsheet form. In fact, a well set up template could calculate them directly, or the user with basic ability to write formula could fill in the columns they need.
Many of the use-cases for Country by Country reports are based not on aggregation across hundreds of firms, but on simply understanding the behaviour of one or two firms. Investigators and researchers often have firms they are particularly interested in, and where the combination of simple data, and their contextual knowledge, can go a long way.
Principle 3: Don’t drop context
On the topic of context: all those footnotes and explanations in company reports are an important part of the data. They might not be computable, or easy to query against, but in the data explorations that took place on Monday and Tuesday I was struck by how much the tax justice experts were relying not only on the numerical figures to find stories, but also on the explanations and other annotations from reports.
The data pipelines approach dropped these annotations (and indeed dropped anything that didn’t fit into it’s schema). An alternative approach would work from the principle that, as far as possible, nothing of the source should be thrown away – and structure should be layered on top of the messy reality of accounting judgements and decisions.
Principle 4: Data making is meaning-making
A lot of the analysis of Country by Country reporting data is about look for outliers. But data outliers and data errors can look pretty similar. Instead of trying to separate the process of data preparation and analysis, these two need to be brought closer together.
Creating a shared database of tax disclosures will involve not only processes of data extraction, but also processes of validation and quality control. It will require incentives for contributors, and will require attention to building a community of users.
Some of the current structured data available from Country by Country reports has been transcribed by University students as part of their classes – where data was created as a starting point for a close feedback loop of data analysis. The idea of ‘frictionless data’ makes sense when it comes to getting a list of currency codes, but when it comes to understanding accounts, some ‘friction’ of social process can go a long way to getting reliable data, and building a community of practice who understand the data in more depth.
Principle 5: Standards support distributed collaboration
One of the difficulties in using the data mentioned above, prepared by a group of students, was that it had been transcribed and structured to solve the particular analytical problem of the class, and not against any shared standard for identifying countries, companies or the measures being transcribed.
The absence of agreement on key issues such as codelists for tax jurisdictions, company identifiers, codes and definitions of measures, and how to handle annotations and missing data means that the data that is generated by different researchers, or even different regulatory regimes, is not comparable, and can’t be easily combined.
The data pipelines approach is based on rendering data comparable through a centralised infrastructure. In my experience, such approaches are brittle, particularly in the context of voluntary collaboration, and they tend to create bottlenecks for data sharing and innovation. By contrast, an approach based on building light-weight standards can support a much more distributed collaboration approach – in which different groups can focus first on the data that is of most interest to them (for example, national journalists focussing on the tax record of the top-10 companies in their jurisdiction), easily contributing data to a common pool later when their incentives are aligned.
Campaigners also need to be armed with use-case backed proposals for how disclosures should be structured in order to push for the best quality disclosure regimes
What’s the difference?
Depending on your viewpoint, the approach I’ve started to set out above might look more technically ‘messy’ – but I would argue it is more in-tune with the social realities of building a collaborative dataset of company tax disclosures.
Fundamentally (with the exception perhaps of standard maintenance, although that should be managed as a multi-stakeholder project long-term) – it is much more decentralised. This is in line with the approach in the Open Contracting Data Standard, where the Open Contracting Partnership have stuck well to their field-building aspirations, and where many of the most interesting data projects emerge organically at the edge of the network, only later feeding into cross-collaboration.
Even then, this sketch of an alternative technical approach above is only part of the story in building a better data-foundation for action to address corporate tax avoidance. There will still be a lot of labour to create incentives, encourage co-operation, manage data quality, and build capacity to work with data. But better we engage with that labour, than spending our efforts chasing after frictionless dreams of easily created perfect datasets.
Tim
I agree with you that this project, if it is to succeed needs a more humanist approach – it isn’t all about the perfect dataset but about bringing people together explore and test how they what meaning can be made out of the data. I gave somewhat similar feedback to the project back in December, based on their draft ‘user stories’ whitepaper, as did a senior tax expert.
However after constructive engagement with Stephen and Jonathan we received a series of rude and dismissive responses from one of the project team members. My motives and integrity were questioned and I was told that ‘the whole premise of your argument …is flawed from beginning to end’ and that ‘none of your comments make any sense at all’.
I raised it with OKFN that usually when you issue an open request for comments, people who respond don’t expect personal abuse. But there was no resolution, and in the end I decided that it was not possible to be involved further with a project that did see the need to set some basic interpersonal ground-rules.
My original comments to the white paper are below. I still believe this project could be useful, but it needs to address the social side of what it means to enable open, constructive participation.
______________________________________
COMMENTS ON USER STORIES FOR A GLOBAL PUBLIC DATABASE OF THE TAX CONTRIBUTIONS OF MULTINATIONAL COMPANIES.
The ‘user stories’ document is trying to do two things at once:
– Firstly, it is compiling design requirements for the database – i.e. easily accessible open data, ability to document where it comes from, include legal definitions used in the data, search and query by country, continent and corporate group etc… (i.e. ‘Open Data’).
– Secondly, it is setting out a wishlist of what people may want like to do with the data in terms of analysis i.e. “evaluate the overall misalignment of profits by MNC and by country and rank these accordingly”, compare profit per employee, undertake ratio analysis using the logic of unitary taxation to assess the likely scale of base erosion and profit shifting. (i.e. ‘…for Tax Justice’).
The first question is relatively straightforward – people should be able to search and download commonly structured data without losing meta-information about where it comes from etc… as such the database would provide a short-cut to searching through accounts and filings from multiple countries.
The second question is more difficult. It concerns the use and usefulness of the data – which remains an open question. It may be that the database project itself in the end does not seek to answer this question, but similarly to Open Corporates focuses on the first step – organizing information in the public domain to make it easier to find and use. People can do whatever they like with the information after that.
However, there is an increasing shift within the transparency and accountability movement towards seeking to articulate and demonstrating specific use cases and impacts of open data (e.g. Jonathan Fox , Nathaniel Heller, and Martin Tisne ) . As the debacle of the UNCTAD trade misinvoicing/ illicit financial flows study this summer illustrated what people would like to interpret from data is not the same what they really can! Hence research institutions ought to have process of peer review and quality control of their findings. Open data approaches do not have these kinds of quality controls, but they face the same challenges in getting towards open knowledge.
I think there is a need for have careful, open, analytical conversation about how CBCR data might meaningfully be interpreted. OKI can play a part in hosting such a conversation, but this would need a different approach than compiling a wishlist of uses (and should be separated out from the database design questions).
Specifically; several of the use cases in the ‘user stories’ document make the assumption that by comparing CRCR data with a unitary formula, mismatches can be interpreted as a ‘misalignment’ and a sign of profit shifting (allowing the extent of misalignment to be ranked etc…). This is not obviously true – for example the recent European green party report on Zara compiled country-by-country data on the company from corporate filings and draws strong conclusions about profit shifting, based on observing where the company has relatively higher and lower profits per employee. However, there is no good reason to think that there should be the same amount of profit and risk associated with one retail employee as with one person making design and supply chain management across the whole group (similarly for banking it is not clear for example that comparing profits per employee between different parts of the business is meaningful, given different business models between retail branches and investment banking).
Practically my suggestions are:
– Avoid building quick-ratio visualisations (e.g. on profits per employee) and ranking tools into the database.
– In the next phase of development of the project separate out the questions about database design from those about analysis – the agile design/user stories discussion about the database design are not a useful framework for considering whether analytical methods are robust and meaningful. One needs inputs from open data/ database design experts, the other needs to engage with tax experts.
– OKI could play a valuable role in bringing together different players to consider the opportunities and limitations for interpreting and using public CBCR data (perhaps by looking at particular company case studies together – this does not depend on there being a huge number of companies to look at).
Hey Tim,
Since we had a good discussion during the workshop (and as one of the organisers of this workshop) I feel it would be good to continue this discussion here.
So firstly – as you know, technical presentation in the workshop was not of a finished tool, but a work in progress for review and discussion (that was clearly stated on multiple occasions). The cumbersome ingest process was presented with the disclaimer that we’re working on something much more user friendly (which will not involve GitHub, tabula or any copy-pasting into YAML files). We also invited comments and suggestions, which we took closely to heart – one example was the requirement to keep the context (such as annotations) during the process, which we will surely incorporate in future versions.
We do see our goal as collecting all the datasets in a predictable and reproducible manner. Although you underplay the importance of such a dataset, it’s my belief that the existence of a full dataset on tax data is much more important than any single analysis of any single company.
We also think that in order to make this goal, we need to use automated tools and minimise human involvement in the process (and the expertise level required to get involved). You claim that this is not needed as this is a task of “small scale”. Yet, out of 1000 documents published each year, the manual processes (along with all experts and unpaid students working together) so far managed to process the underwhelming number of 50 documents. To get to 1000 (or 6000 in the future) we’d need a twentyfold increase in the amount of work and to sustain it year after year. Franky, I don’t see that happening.
I honestly believe that we should engage people outside the small clique of tax experts and Master students. When doing crowdsourcing projects in Israel, we managed to engage a whole lot of concerned citizens with little domain knowledge but lots of motivation to fight injustice – and that’s from a population of less that 10 Million citizens! I can only imagine what can be done with the 50 times more people living in the EU.
You claim that the individual researcher or journalist doesn’t need the large database for inspecting one company, and you’re right – they can probably extract the data they need from the specific two or three PDF documents in less than an hour’s work. But the catch is that they also won’t care about any standard and won’t bother to contribute it to any centralised repository – and that’s fine, as our aim was not only to save them that hour’s work. If and when there’s a good dataset available, it might be of use to these researchers – but they don’t lose much by not having it. On the other hand, by taking this approach of relying on individual contributions we will always be left with a partial database with questionable quality and origin, which eventually will be unusable.
[And I didn’t even mention the biggest problem which is not mentioned in your post, which is not collecting the data, but correctly normalising the different datasets and making them comparable – but that’s for a separate post].
The most important point, though, is that I feel that your post demosntrates the same problem that affects so much the open-data movement today. From a movement that started with very noble goals and dreams, it has become consumed with data formats and scraping techniques.
But the truth of the matter is that the end goal is creating social change – not perfect datasets or PDF scraping communities. I know that data publication quality is indeed a problem, but we should use the technology we have to eradicate it, and not prolong it by building around it communities, culture and traditions.
Best, Adam
Thanks Adam for engaging with this discussion.
My sense is that your response highlight we have a fairly fundamental difference of opinion here.
You say
“we need to use automated tools and minimise human involvement in the process (and the expertise level required to get involved)”
I disagree. Human involvement is central to meaning-making with tax data. Good design can indeed reduce the amount of expertise needed to **start** getting involved, but it should support people to acquire and exercise increasing degrees of expertise over time.
You also appear not to be considering the questions of what might incentivise someone to get involved with contributing to a dataset, or how a potential user of the dataset might come to trust it. These human aspects of an open data project need to be fitted in with the technical design choices made. Past experience of successful crowdsourcing does not mean there is a pre-existing constituency for the next crowdsourcing project.
In a later section of your reply you also say:
“the catch is that they [researchers transcribing their own data] also won’t care about any standard and won’t bother to contribute it to any centralised repository – and that’s fine, as our aim was not only to save them that hour’s work. If and when there’s a good dataset available, it might be of use to these researchers”
On what basis? Why do you believe researchers won’t care about contributing to shared data when offered patterns and approaches to do so? Why would they then choose to use a centralised dataset they have not been engaged in constructing? My key contention is that design sprint, and design process in a fuller form, needs to be engaging with questions of motivation, incentives and social infrastructure – not bracketing this out as something to deal with latter, or to treat on the basis of assumptions.
I don’t dispute that the long-term goal should be to extend the debate out beyond tax experts (and indeed, as a non-tax expert I’ve found the opportunity presented by engaging with OD4TJ to learn more about the structures of tax to be a positive one), but we should also not ignore existing communities of practice if we want to see substantive social change.
It may be that our biggest disagreement is ontological. You suggest that ” correctly normalising the different datasets and making them comparable” is the biggest problem. I agree that comparability is a challenge – but this is not just a question of data, it is a question of the policy and definitions underlying it’s original production – and what I was hearing from the workshop suggested that it will take more than work on dataset design and normalisation to render tax disclosures fully comparable. Instead of a goal of ‘using technology to eradicate data publication quality problems’, my contention is that we have to design socio-technical approaches that engage with the messy reality of data production – and that empower people to use data as a tool of change.
Hi Maya. I’ve email you in response to your comment here. I do not know the context in which you felt personally abused for you comments last year, but I’d like to listen (I’m part of the leadership team at Open Knowledge International).
Paul (and also Pavel) – thank-you for getting back to me and making clear that the comments I received were completely out of line. I am glad that OKI is working on improving its terms of engagement so that its values and norms of constructive engagement are made more explicit to partners.
I have persevered with this not just because I think people should treat each other with basic respect and tolerance (they should!) but because the point of all of this is to drive learning. If gatekeepers prevent people with different views and knowledge from engaging together by making it personally unpleasant, and undermining the legitimacy of those that do, then the process doesn’t generate learning, but actively prevents it.
Yesterday Transparency International published their output from the Sprint event – which illustrates the point. http://transparency.eu/open-data-for-tax-justice-project/
TI ran with the old campaigning claim that in 2009 Barclays Bank had an effective tax rate of 1%. This is simply false, and is easy to check since the actual figure of 23.4% is clearly accessible in Barclay’s annual report. But data availability and accessibility is not the primary limitation here. The gap seems to be around broad feedback mechanisms — it is not that there aren’t peer feedback loops, it is that they are very tight — so everyone in the close network who discussed the case at the Sprint, looked at that article before it was published and shared it afterwards must have thought the 1% figure looked ok (… because no one else in the tight network has refuted it). While any tax experts who saw the article will have rolled their eyes and probably concluded that the whole effort was not carried out in good faith.
Going back to Tim’s original post – I think this raises strategic questions for OKI, which go beyond safeguarding basic norms and values – i.e. when is it useful to focus on cleaning or standardising data and building the technical capacity of civil society to manipulate data formats etc… and when is it more useful to grasp the nettle of convening difficult context-rich conversations to empower civil society to use data meaningfully.
Perhaps OKI sees its specialisation more as focusing on the first role, but in cases like this this this strategy seems likely to be ineffective – just making the data available is ‘pushing on a string’ if civil society does not have the capacity to use it meaningfully. And it is always possible to call for more data, even as misunderstandings and antagonism mount up.
I hope that on this topic OKI is able to play a positive role in enabling tax experts and civil society organisations to feedback constructively and learn together.
Hi Maya,
Thanks for commenting. As we discussed this week, I am indeed in the process of preparing a code of conduct that is explicit for any partners on our projects, so that they are aware of the basic terms of engagement and participation that we value, and with the goal of preventing interactions such as those that you and Iain Campbell were subject to during the call for comments on the white paper “What Do They Pay?” earlier this year.
(For full disclosure for anyone following this discussion, I am updating Maya directly by email on all the steps I am taking, after we agreed on such steps in a call last Monday.)
On the other issues you raise:
The blog post that Transparency International published is not something that OKI has control over, nor, do we wish to. The fact that it was an output of the sprint does little to change that. I know you have also reached out to TI on Twitter on this. I can’t speak with more authority on this than they can.
On the point about Tim’s post and strategic questions for OKI. We will follow up by publishing on the OKI blog, but the key issue here is the extrapolation of one stream of activity during a 2 day event to posit an entire approach to a project. There are a number of problems with using that to frame a narrative, likewise with the binary framing (tech vs social, centralised vs decentralised).
In your comment you are asking if OKI is missing an opportunity to use data meaningfully if our focus is on cleaning and standardising data. That is a great question in general, but also, in general, OKI does not only focus on cleaning and standardising data. While in the Open Data for Tax Justice project, it was an important stream of activity in the 2 day sprint last week, it has in fact been a very small part of our overall activity, which was by and large research, culminating in the aforementioned white paper.
Finally, I too hope that OKI can play a positive role in enabling tax experts and civil society organisations to feedback constructively and learn together. I truly hope the steps that you and I have discussed for interaction around the project, which is currently via the mailing list, can be an important step in that direction.
Thanks Paul for the updates here.
Just to note and clarify before any wider response you be preparing: my argument is not about binaries. Nor is the above post, which talks about being ‘more decentralised’ (a graduated question of where emphasis and balance in design goes), not a simple choice between centralised or not. Equally, central to my point is that design needs to be socio-technical: to understand how the social and technical interplay – but also to recognise that in this particular case, the unresolved design challenges are not with extracting data from PDFs, but are with engaging a community of practice around the production and consumption of that data. The ultimate product will of course need both technical and social design.
I’d also note that I was specifically referring to the approach taken in the workshop, and that appeared from the follow up blog post, to be the approach being taken forward into next steps. I was not offering a critique of the whole project. Indeed, the user-stories approach taken in the white paper is a positive step for a data project, and could have been followed through on more in the workshop.
However, on the TI blog post, I will note that in the workshop participants were encouraged to write up what they had discovered – and issues of verifying information through journalistic practice were not discussed. I think this is something OKI as workshop host needs to reflect upon, in terms of recognising that skills in using information for advocacy involve more than just extracting findings from data, but should also involve various forms of validation. (On the substance of the TI post – I have no specific view right now – and have not had chance to fully review the arguments to come to one).
Hi Tim,
Thanks. It is good to hear that the intention was to talk about the workshop, rather than the project at large. I think the subtleties of that might be lost with phrases like “Open Data for Tax Justice project heading further down this flawed path”, or even the principles (which, of course, we basically agree on), which definitely read to me as principles for running a socio-technical project, and not necessarily all principles that would be manifest in a 2 day sprint.
I think you are spot on in regards to the issue of journalistic practice. We have quite a bit of prior knowledge in this regard and on reflection it was missing from this sprint. I think this is something we can work on in future sprints like this one.
In any event, we intend to share some of our thoughts with you pre-publication, for your early feedback, and towards better collaboration on such projects in the future.
Thanks Paul & Tim
Also to clarify – I do not mean to suggest that OKI should have controlled or prevented TI’s blog post. Rather the fact that TI’s analysis coming out of the workshop relied on a spurious number illustrates the problem with the approach (…so far…) of not bringing a wider group of people into the conversation. If someone had picked up this number in the workshop a discussion could have been sparked. Instead Elena probably assumed it must be reasonably credible.
(I should write up, and post on the substance of the issue with 1% figure separately — i will post it on my blog).
Its never too late, and I am glad that OKI is engaging on these questions and challenges. I think it is fair to say that the project has constrained its engagement with tax experts to those within the tax justice movement ( i.e.: TJN, FTC, ICRICT and associated organisations) . This is not an extrapolation of one stream of activity during a 2 day event, but is reflected for example in the list of people who provided inputs to the White Paper http://datafortaxjustice.net/what-do-they-pay/#acknowledgements. The strong negative response of one of the white paper authors to my comments on the white paper back in December were specifically reacting to the idea of widening participation in the dialogue, and of seeking validation by anyone outside of the tax justice movement.
Hi Tim. We published a blog on building open databases. A colleague pointed out that sharing pre-publication would actually be against an open approach, and I agreed. We published a post that, while not a direct response to this blog post, does seek to outline an approach we have been employing towards building open databases, which may be useful for readers of this post: https://blog.okfn.org/2017/08/10/23170/