top of page

The dawn of a new CT format at PhUSE CSS!?


Disclaimer: This is my personal view of the PhUSE conference with a focus on the work group in which I participated. The descriptions and impressions are how I interpreted them.

PhUSE is an organization working with standards and specifically implementation and tools around standards. CDISC is an important partner in this work in producing the standards. The line between CDISC and PhUSE is though far from clear and it is sometimes difficult to say where the responsibility of CDISC ends and PhUSE starts.

This meeting was the 6th PhUSE CSS (Computational Sciences Symposium) meeting being organized. The US meeting is always in Silver Spring outside of Washington to allow for FDA delegates to attend as easily possible.

The meeting was kicked off already on Sunday afternoon in the Civic Centre in Silver Spring. Steven Wilson from FDA made it clear in his initial speech that this is not an ordinary conference where you can sit back and enjoy. At PhUSE CSS you “roll up your sleeves” and actively contribute in the various working groups. It should be a “crucible of thoughts, ideas and solutions and a place for dreamers and doers. All work should be done in a collaborative, open and transparent way and the result should be openly shared!” – In short, inspiring and exciting!

After the initial “stage setting” speech it was time to present the various working groups. This year there were five of them presented:

  • Improve analysis (Tables, Figures and Listings) – Led by Mary Nilsson from Eli Lilly.

  • Optimizing use of data standards – Led by Jane Lozano from Eli Lilly.

  • Linked Data and Graph Databases – Led by Tim Williams from UCB.

  • Emerging trends and technology – Led by Geoff Low from Medidata.

  • The non-clinical topics – Patricia Brundage from FDA.

The emerging trends and technologies group was clearly one that would have been very interesting to participate in but my focus for this meeting was the Linked Data and Graph Databases. To some extent those groups overlap and the Linked Data group is an off-spring of the emerging trends and technologies group. The subject “alternative transport formats” being discussed in the emerging trends group clearly is related to the Linked Data group since the goal of that group is to define a new standard for how to structure and store SDTM data in a “graph like” multidimensional way overcoming many of the shortcomings of the 2-dimensional SAS v5 format being used today.

After a short break with some drinks and snacks the evening continued with a very well managed workshop, led by Tim Williams, on Graph databases. The work shop went very smooth and it covered both property graphs using Neo4j and RDF using Protegé. The workshop was run on virtual servers in the cloud, one for each participant, with all necessary tools and data installed.

The second day (Monday) was initiated by Crystal Allard from FDA. She spent much of the talk on statistics on submissions and how well they comply with the validation rules at FDA. A surprisingly low number (only 55%) of the submissions where ok which means that 45% were rejected. This figure clearly indicates that there are still a lot that needs to be done in tool support for, and usage of, available standards. The same problem and approximately the same figures were presented also by Elaine Thompson when presenting the status of compliance in non-clinical data.

The more introductory talks by FDA staff was followed by one presentation made by Mary Doi, also FDA, on how analysis tools are used for NDA/BLA Clinical Review. It was quite interesting to get the view from a reviewer perspective and once again it bespoke the need for better standards but to start with better implementation of and compliance to the already existing standards.

All in all, the morning talks did set a very good ground for the coming work in the workgroups that followed.

The work group on linked data and graph databases was introduced before lunch on Monday by Tim Williams and Scott Bahlavooni, D-Wise. As Scott put it we are stuck in the traditional way of doing things accepting the limitations and problems coming with “old age” standards and data formats (like the SAS transport format dating back to 1986). We must be more curious and explore what new tools and techniques can bring. Real change come from the bottom so we need to showcase the possibilities with new technologies and how it can solve day to day problems still not disrupting the normal work processes. It is a challenging task but sooner or later things must change.

The root problem that the Linked Data and Graph Database group is trying to approach is the fact that the current SDTM format, even though it is only a data transfer format from the start, is trying to squeeze a multi-dimensional clinical trial world into a two-dimensional table format based on fixed character ASCII files. The problem would perhaps not have been so big if it was not for the fact that this table format, in many settings, also has turned into a storage format since following the format guarantees compliance with authority requirements. The fact that it is a very limiting and error prone format does not seem to be important.

One approach that has been tried before is to define a new transfer model based on xml to overcome especially the fixed length problem and allow for Unicode which is a great problem in for example Japan. It does improve the situation but it does not solve the root problem mentioned above. We are still limited by the two dimensional SDTM thinking of today.

The approach should instead be to define a new model that better represents the reality of the Clinical Trial that we are trying to capture. We should look beyond the two dimensional SDTM structure and not limit ourselves. However, we should as far as possible continue to use the CDISC defined lookup values and various concepts since it for a foreseeable future will be necessary to generate the “old” SDTM structure from the new format.

One good candidate for the new standard is RDF. Already today the CDISC implementation guide and the CDISC code lists are available as RDF and this is also true for many of the NIH vocabularies used in CTs. What is missing though is some of the important external ontologies. Armando Olivia, Semantica (previously employee of FDA), specifically mentioned MedDRA and WHODrug. Contact is however made with MSSO (which also had a representative in the group) so hopefully it will be available when it is needed. Work has also been done at the UMC in this area and a copy of the draft version of a WHODrug ontology schema was shared.

The alternative to RDF that was proposed is the property graph Neo4j with the graph language Cypher. Cypher and Neo4j is not as standardized as RDF though (RDF is a W3C standard). Neo4j is much simpler and easy to “play around with” but it is not suitable as a transport format (yet). The transport format is an important aspect since it is foreseen that a future new Clinical Trial standard on RDF format could also serve as a transport format, simply transferring the entire study with accompanying stylesheets and Sparql queries for listings and analysis!

The working group did some exercises with Neo4j, playing around with different graph representations of SDTM data. The future work in the work group will though probably build upon the work already done by foremost Armando Olivia and Tim Williams using RDF. Below is an early draft suggestion of the structure of such a "Clinical Trial Ontology".

In addition to the working groups there where a poster presentation session and a tutorial on “Code sharing Utilizing GitHub”. GitHub is very much used in the PhUSE working groups to share code and documents.

The conference continued until 4 pm on Tuesday ending up with a panel discussion with participants from FDA, CDISC and PhUSE.


bottom of page