Improvements to CR e-Lab

We propose a meeting at FNAL to discuss the e-Labs (31 March- 2 April). While most of our concerns are about Cosmic, we may also discuss other e-Labs for a limited time.

Our experience with phone- and video-conferences is limited to one- or two-hour events. Even these are sometimes hard to maintain connections and focus with all attendees.  This meeting will take place over three days. We feel that this duration is difficult with participants in different places.

We suggest the following attendess: Bob, Liz, Edit, Tom. Ken might also wish to come for the Cosmic discussion. We can set-up a two-hour conference call with Tom McCauley and Dale Ingram near the end of our time.

DIscussing these issues will help us to decide how best to spend our limited resources. We have been squashing bugs and putting out fires for too long. It's time to look at the entire system critically and decide what we can afford to address and what we have to leave as it is.

JUST IN: We might need to take into account that we will be working with Drupal....

Abbreviated List (March 26)

Watch standing
  Who is watching the servers to be sure that everything is up. Who can they call if something goes astray?

Daily server observation
   What does Edit do during in the background to keep an eye on the servers? Is nagios enough?

Rollout and Rollback instructions
   Do we have clear instructions about how to do these?
Something bad is happening
   What do we do when things go astray? When is it OK to restart the server?
Drupal accounts
   Who will approve these new accounts as they come in?
Remains from original list
  • Concurrency testing
  • Load balancing (priority queue)
  • Data uploads
  • Remove bottlenecks in searching
  • Bugs
  • Feature requests
 

Original Long List

Topics for discussion include:

Performance of the Cosmic e-Lab

The e-Lab is showing its age. It falls over too often in workshop or classroom settings. We need to explore ways to address this. We have just rolled out a big change in analysis workflows that should help with performance issues. We plan to test this recent update, but need to also discuss other areas where we can improve performance. These inlcude:

  • re-plotting
    • We currently re-run an analysis if the user wants to do something as re-scale the axes. This is a waste of compute cycles. The bless plots allow for redoing axes just by clicking on the plot. We should add that feature to plots from the analyses as well.
      • EP: This could take some time because we need to change the graphics tool from svg to flot. That involves changing the perl code, possibly the swift scripts and new coding for flot, json and gson.
  • Java vs. Perl
    • All of the analysis workflows rely on aged Perl scripts. We moved ThresholdTimes to Java and realized a big performance boost. Is it worthwhile to consider re-coding the rest of the workflows in Java?
      • EP: That could be done but it will take some time to convert from the Perl to the Java plus testing to make sure that all works well (including rounding). Also we would need to modify swift scripts to call java rather than perl (it might be easy, not sure).
  • Concurrency
    • ​The e-Lab gets very unresponsive when there are multiple users. This problem may have improved with the persistence of the .thresh files. We should still consider testing before the summer workshops start.
      • We might try running new stress tests with multiple users and see if this is still the problem. The last couple of reports that I received about the server "hanging" were due to other reasons (still investigating that but not due to multiple users/analysis running).
  • Load-balancing of jobs
    • ​Is it possible to make a queue that will manage jobs in order to reduce pile-up?
      • EP: I am working with this and we will soon have a queue for testing :)
  • problems with scaleability - too many data files plus a more complicated search with blessing are bogging down searches (do we need to keep all of these files? Can we simplify what we search for?)
    • We need to discuss this in a telecon and come up with ideas by looking at the problem areas (cosmic search data).
  • review our infrastructure (vdc, physical files, etc).
    • We probably need to involve other team members to discuss this issue (Mihael, Mike W.). 
  • Workshop use is limited because response time is so long when 15 students make simultaneous requests. Improvements in multiple job submission or a list of conditions that allow 10 simultaneous jobs would be helpful.
    • We need to test this again.
  • Data uploads
    • Is there any way to improve the speed of data uploads? They take so long that something must be wrong on the server side. Tom sent mail about this on 19 Feb.
      • We need to find tools to test what causes the delay. Uploading from the user machine to the server can some time depending on the size of the file, the internet connection (I guess), browsers (?), etc. 
Look and feel
 
The e-Lab's user interface (UI) is also showing its age. There may be easily-accessible (i.e., free or cheap) solutions to updating the current UI codebase to something that looks and feels like 2014. There are also specific interface issues beyond this large picture. These include:
 
  • Shower interface
    • When doing shower analyses users should be able to look for nearby detectors. The current interface gives them no information about separation. Users could select a detector at Fermilab and one in Japan. We've talked about using a map and a circle, but it doesn't have to be that fancy.
      • This problem could be solved now without coding for a tool.
  • Can we use mashups to make 2014 UI? Could this improve how we display logbooks, milestones and references?
    • Needs to be well-thought out. Not difficult to implement but time consuming because you have to work with the whole framework.
  • A better interface for the registration and management of users
    • This functionality needs a major rewrite. Not difficult, time consuming.

​Data mismatch

  • Location mismatch
    • if a teacher has an account at a school in Wisconsin and uploads data, the data looks like it is in Wisconsin. If that teacher takes the detector somewhere else (e.g., the South Pole) and uploads data, the data look like they are in Wisconsin.
      • Add documentation for teachers to get new accounts if they move schools or want to take data far away from their school
      • Getting a new account is how we currently do this. It's a workaround and puts the burden on the user. The difficulty comes from the metadata for the location of the uploaded data comes from the teacher's loging. The metadata could also come from the geometry of the detector. This is a much bigger solutio and harder to code--maybe it's "pie in the sky," but it solves, rather than works-around, the problem. Doing location this way also makes it easier to change the shower interface. 

Keeping up with new versions of system level software

We need to keep on the web treadmill as our software stack ages.

  • Java 6 vs 7 / Tomcat 6 vs 7
    • Already working with this.
    • We need to upgrade. This includes setting up the environment and make sure our existing code migrates OK

Unit Testing

  • The webapp lacked unit testing: code that is used to determine that each part of a program is working correctly. These tests are written at the same time the code is written to prevent buggy code and later to make sure that the code is still working with updates, new additions to the code, etc. Started adding thse but it is still a working in progress.
  • Also, we need to add some automatic testing: the deployment of the application has presented some problems a few times. It seems that there is a race of threads at some point and the compilation of some classes is not 100%. A new deployment usually fixes this but we need some automated tools to tell us whether the rollout is 100% instead of manually testing the website.

Better Administrative Tools

  • More monitoring tools, etc.
    • Replicate command line ways of retrieving information from the server
    • Make sure that the statistics we are gathering are correct (e.g., number of pretests
    • A tool like NGOP at Fermilab that checks whether the site is up. It emails or calls a telephone if the site is down. Argonne must have this.
Documentation for Project
 
  • What can we use to store all the information about the system?
  • University of Chicago Developer's wiki vs Drupal.
  • Do we need more documentation on the analyses and how they work?
Scaffolding for Users
 
  • Should we update our screencasts or add more? (e.g., screencast of data blessing on upload)
  • Update the references associated with milestones.
Understanding our users and e-Lab usage
 
  • Inspect the Tomcat log for feedback (analyses, cms event display, interactions with the database).
  • Look at trends in the number of logbook entries, pre and post-tests entries and other items in the e-Lab over time.
  • Use a tool like Google analytics to see what the users are doing (logbook, milestones, references, etc.?)
  • Generate a survey to learn what the users (and not users) are doing.

Marketing our e-Labs

  • How can we retain and increase our audience?
  • Make NextGen more prominent.

Examination of the original assumptions and goals of project, how we have succeeded and whether they are still relevant.

  • (e.g., sharing data especially for shower studies, providing analysis tools to schools)
  • Is a shower study using multi-school data practical with our tools?  Do we need more functionality and visualization?
  • Do we need the tight coupling between milestones and the logbook?

How does the Purdue Java tool influence what should be in the e-Lab?

What new technologies might replace what we have implemented

  • Should we be replacing log books with some open source solution.
  • Should we be using Cloud Computing?
  • Should we still be using Tomcat and Java Server Pages?

Getting rid of wiki dependency

Handling glossary, etc. consistency in all e-Labs

Should we continue to try to support and develop the CMS and LIGO e-Lab?

    Yes. Next question.

Hardware Needs

Feature Requests

  • allow user to superimpose multiple data sets on one plot, i.e., Barometric Pressure vs. Flux.
  • histograms of times between counters "i" and "j" (6 possible 2-fold combinations)  Aplications in order of usefulness: speed of muon; performance studies; calibration of relative counter timing for use in shower reconstruction and for estimates of intime counter hits versus random backgrounds.
  • Counter multiplicity and logic requirements for any analysis.  Install ability to require counter a specific set of counters, e.g. Counters 1 and 3 but not counter 4, within a user-specified time window.  This is critical in lifetime studies.  If relative times from suggestion 1 were available per event, that would also enhance muon lifetime measurements.