Skip to content

Multiple sockets seem to cause crashing. #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
MatthewMenze opened this issue May 10, 2020 · 16 comments
Open

Multiple sockets seem to cause crashing. #7

MatthewMenze opened this issue May 10, 2020 · 16 comments

Comments

@MatthewMenze
Copy link

The socket close handler reports the incorrect socket closing, and seems to be related so some sort of device crash.

I am using the LTE shield with a Metro M4 Airlift Express board, using the hardware serial interface.

  • My firmware starts up the device, then opens 1 UDP socket to periodically writes a small JSON to the server, this socket behaves as expected, and takes up socket 0 on the LTE module.

  • Next my firmware opens two additional TCP sockets, which each listen on a different port for periodic data packets, taking up sockets 1 & 2. These sockets are polled with the library function poll() on every program loop. They seem to read data correctly as I send out packets. However, upon the connection closing the library reports to the socket closed callback that socket 3 has closed. After a number of connection close events, it starts reporting socket 4, then 5, then 6. When it reports socket 5/6 it stops triggering the socket read callback. Then shortly after it seems to stop functioning entirely, and I get TCP errors on my transmitting device saying unable to connect. It seems the same socket value is given regardless of which port I send a packet to (IE, the socket reported to close is unrelated to which listening socket I hit)

  • This issue seems to be partially mitigated if I call socketClose() on the misreported socket. Although it will still crash with enough events.

Based on experimentation, I believe this is an issue of incorrectly tracking and handling socket events, causing resources to be incorrectly de-allocated as sockets are created and destroyed. Maybe this is related to the memory issues I see mentioned in other issues and commits?

@MatthewMenze
Copy link
Author

MatthewMenze commented May 10, 2020

After additional digging into the library source code, the issue appears to be due to how incoming URC's are processed in this library.

In the poll() function incoming serial data is parsed until the first newline character, then from there it is checked for various URCs, and processed accordingly.

When processing a socket read, the library then sends a command to read the indicated socket data, then waits for a response from the LTE module with the socket data. While waiting for that, if additional packets come in, and cause the LTE module to transmit another +UUSOLI URC message, it will get eaten by the loop in socketRead that is waiting for a response from the USORD command sent.

This will cause data to exist on the LTE socket without the library knowing it is there, effectively rendering that socket unusable. Eventually if this happens enough, all available sockets are filled with data, but the library doesn't know to request it, and the whole thing locks up.

I think there is a few pieces required to fix this.

  1. Polling needs to be buffered. Ideally you would have an interrupt service routine handle any time there was data available on the line, and then push that into a buffer for the poll() function to iterate through and handle any commands stored within.
  2. There needs to be logic that handles unexpected data in the socketRead() function, and does something reasonable. Maybe flush all sockets data may be sitting on? (If it's caught here, it is probably fragmented/to late)
  3. Waiting for a response for up to 1 second in the socketRead() is probably going to cause headaches, especially if you are not using an interrupt or similar to grab and buffer data...A lot of new events could come down the pipe in a second.

I'm working implementing some sort of work around for this, but I'm not super familiar with the library, and may need to ditch this LTE Module entirely for the sake of my clients project.

@MatthewMenze
Copy link
Author

I've modified the read and write code paths in the library quite a bit now, and resolved much of the issues. Once I am sure it is working right I will upload a updated copy of the library files for you to review. My changes are not ready to integrate into the project out of the box, as I am not familiar enough with the library to tell if I have broken other parts of it, and I have not fixed the GPS functionality as I am not using it.

The major changes I had to implement are:

  • The poll function now first reads in from a backlog buffer of events that were read while waiting for command responses, then it reads in from the RX buffer and tokenizes each event read, before looping each event through the parsing logic. Before the poll function would hand the entire read into the parsing function, allowing only one of each event parse type to happen during a poll cycle, and tossing the rest. After each event is parsed, the backlog of any additional events read during executing that event are added to the buffer as well.

  • In the function that flushes the RX buffer before writing a command, it now reads those into the backlog buffer to be processed by the next polling cycle, instead of simply dumping them.

  • In the function that writes a command and waits for a response, data read in before response is added to the backlog buffer.

  • In the function that waits for a response after issuing a socketWrite(), the loop also looks for a error message instead of just "OK\r\n". Previously it would instead of to time out, causing much latency, if "OK" was not sent. Read data is also added to the backlog buffer.

@nseidle
Copy link
Member

nseidle commented May 13, 2020

Thank you for all your work on this! Please keep it coming, I really appreciate it.

If you're able to get to a finishing point, a PR would be ideal. If you have any examples you'd like to add to the library we'd gladly accept them as well.

@ROSW6341
Copy link

This took a lot of time to resolve so I'm passing it along here. I have several R410 modules and found it impossible to connect to a TCP server. They would open the port but refused to connect. Running the ATI19 command in pass-thru mode returned L0.0.00.00.05.06.02.00. After a lot of searching I found the current version of the firmware is L0_0_00_00_05_08 which can be updated via the USB connector on the SparkFun LTE Shield. This version is suppose to correct the problem.

@MatthewMenze
Copy link
Author

I'm getting closer now, but still running into new issues I have found, I'll look at checking this repo out and making a pull request at some point...So far I have just been writing against the files directly:

  • In a few functions, the memory allocation for the response char* passed to the sendCommandWithResponse() function is insufficient if the device returns a error string, causing writes to unallocated memory. I've fixed some but not all.

  • It looks like the allocations for command char*s may not be sufficient in some cases either. I am still working on this to be sure, however.

  • The function socketWrite() does not do any parsing of the response, or error checking of how the command returned. If it doesn't get the expected value "@" returned to indicate the module is ready to accept data, it writes the data anyways. I think this could be causing issues. I am still working on it though.

My fixes will probably not work for the general case, for one I am not handling buffers being filled super gracefully, I just check to make sure I am not writing out of bounds. If a user is only sending data, they will eventually fill. (If they are not calling my poll function regularly to handle events)

Realistically, they probably should always call some sort of polling function, as that is how you would detect and process errors.

You may want to consider the structure of the sendCommandWithResponse() function as well. At a minimum I think it may need a "expected error" as a argument, but more realistically, I think it might be worth restructuring the function to just read until new data isn't coming on the pipe (I use a ~1ms delay from the last received character at the default baud rate, and will try to reduce that more), then rely on the calling function to parse and act accordingly. Maybe not though, that is just my thoughts coming in. It would probably make handling error responses cleaner, as you wouldn't need to guess the length coming in.

I am using a board with a fair bit of memory on hand, so I have just been reallocating the char*s to 128 characters, I also run both the RX and backlog buffers as 2056 characters. This might not be reasonable on other devices.

You may also consider setting up read and write interrupts that buffer all this, and then parsing those with a regularly called device service routine, it seems like that is how this device wants to work, but I could be wrong.

@MatthewMenze
Copy link
Author

@ROSW6341 thank you for the pointer! I contacted support at uBlox and they provided me the easyflash tool and updated firmware file. What did you need to do to connect via USB to the shield? So far I have just soldered the jumper required to use the USB interface, did you need to do anything else?

I notice my computer only recognizes it is connected if I hold the power button down on the shield.

@MatthewMenze
Copy link
Author

@nseidle I've created a pull request of my working-for-me batch of fixes. It will need some more work to integrate into something usable for everyone using this product. Please feel free to reach out to me with any questions.

@MatthewMenze
Copy link
Author

I've updated with a second commit code to listen and receive UDP packets. It's a bit rough, particularly in the parsing of the read data (I plan to clean that up).

The library as it was did not have any code for parsing the URCs related to reading UDP packets as specified in the ublox documentation. Although you still use USOLI to set up a listening socket, the URCs that come in from a received UDP packet are different. Additionally, while you can read UDP data with USORD, the uBox spec says it is preferred to us USORF for UDP data, so I implemented that as well.

@MatthewMenze
Copy link
Author

@nseidle just wanted to see if there has been any updates on this issue on your end? I seem to have it working decent now, but I have really tore up the codebase, even beyond the pull request I made a while back, and I'm not sure if it impacts the usability for things outside my specific use case.

@nseidle
Copy link
Member

nseidle commented Sep 18, 2020

Hi Matthew - Sorry for the radio silence. We had to put this product to the side for a bit. We have tentative plans for a SARA 5 based product. I plan to either fork or otherwise modify this repo to be more generic (just a ublox SARA library rather than a SparkFun LTE shield product library). If we can convince you to bear with us, we'd love to work with you.

@MatthewMenze
Copy link
Author

No worries. Feel free to reach out whenever. I've changed around a few more things since this pull request was made, but haven't had a chance to really polish anything. I am using this circuit in a proof of concept prototype, but the project is nearing the point where we will develop custom hardware and firmware so there hasn't been much oppurtunity to refine stuff beyond the proof of concept point.

I'd love to help out refining this product though, the SARA line is great, and I think there is a lot of room to make this product preform better. Especially with some of the integration options that seem to be on the horizon between the SARA units and uBlox ZED chips.

Is there a way to reach out to the products team at SparkFun? There is actually a couple projects at my company floating around we wanted to see about working with Sparkfun on. Not sure the best point of contact for that.

@nseidle
Copy link
Member

nseidle commented Sep 18, 2020

Neat! Start with me nathan @ and I'll get you connected with the right people.

@MatthewMenze
Copy link
Author

Awesome! I sent an email to the address listed on your github. Thank you!

@MatthewMenze
Copy link
Author

@nseidle has there been any updates getting the buffered polling working in this library? I'm still using my hackish fork of it, somewhat updated from this issue, but I don't trust it nearly as much as I would like.

I notice some buffered polling is implemented in your SARA R5 library, I am considering doing some hardware prototyping with the R5 boards you offer, but I'm just not sure what the state of it is. I'm between sticking with my hackish version or migrating.

Thanks!

@nseidle
Copy link
Member

nseidle commented Mar 5, 2021

We are dedicating a lot of resources towards the SARA R5 library these days. It is our focus from now on but we will support major issues in the R4 library as well. We haven't had the chance to address this issue yet. If you'd like hardware to test out the R5 I would be happy to provide it to you.

@MatthewMenze
Copy link
Author

Thank you for the update! I would actually be interesting in testing out the R5, it is our target chipset for production. I went with the R4 shield as it fit into a proof of concept prototype I was working on well physically.

I was actually looking at testing with one of these from your product catalog: https://www.sparkfun.com/products/17272 for a project a bit further down my pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants